4. Searching for Words
Let’s start by analyzing Moby Dick, which is
text1 for NLTK.
The first function we will look at is
concordance. “Concordance” in this context means the characters on either side of the word. Our text is behaving like one giant string, so concordance will just count the number of characters on either side. By default, this is 25 characters on either side of our target word (including spaces), but you can change that if you want.
In the Jupyter Notebook, type:
The output shows us the 25 characters on either side of the word “whale” in Moby Dick. Let’s try this with another word, “love.” Just replace the word “whale” with “love,” and we get the contexts in which Melville uses “love” in Moby Dick.
concordance is used (behind the scenes) for several other functions, including
Let’s now see which words appear in similar contexts as the word “love.” NLTK has a built-in function for this as well:
Behind the scenes, Python found all the contexts where the word “love” appears. It also finds similar environments, and then what words were common among the similar contexts. This gives a sense of what other words appear in similar contexts. This is somewhat interesting in itself, but more interesting if we compare it to something else. Let’s take a look at another text. What about Sense and Sensibility (
text2)? Let’s see what words are similar to “love” in Jane Austen’s writing. In the next cell, type:
We can compare the two and see immediately that Melville and Austen use the word “love” differently.
Let’s expand from novels for a minute and take a look at the NLTK Chat Corpus. In chats, text messages, and other digital communication platforms, “lol” is exceedingly common. We know it doesn’t simply mean “laughing out loud”—maybe the
similar function can provide some insight into what it does mean.
The resulting list is a lot of greetings, indicating that “lol” probably has more of a phatic function. Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.
Challenges for lesson 4
Which one of the following sentences is correct: (Select one of the following)
Terms Used in Lesson
Which one of the following sentences is correct:
Which one of the following sentences is correct:(Select one of the following)
Terms Used in Lesson
Can you define the terms below? Hover over each of them to read a preview of the definitions.
Nltk function that allows to see the characters on both sides of the word; an easy way to investigate the context of a certain word across a corpus.
Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.