Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

14. Make Your Own Corpus (continued)

Now we are going to transform that string into a text that we can perform NLTK functions on. Since we already imported nltk at the beginning of our program, we don’t need to import it again, we can just use its functions by specifying nltk before the function. The first step is to tokenize the words, transforming the giant string into a list of words. A simple way to do this would be to split on spaces, and that would probably be fine, but we are going to use the NLTK tokenizer to ensure that edge cases are captured (i.e., “don’t” is made into 2 words: “do” and “n’t”). In the next cell, type:

don_tokens = nltk.word_tokenize(don)

You can check out the type of don_tokens using the type() function to make sure it worked—it should be a list. Let’s see how many words there are in our novel:

len(don_tokens)

Since this is a list, we can look at any slice of it that we want. Let’s inspect the first ten words:

don_tokens[:10]

That looks like metadata—not what we want to analyze. We will strip this off before proceeding. If you were doing this to many texts, you would want to use Regular Expressions. Regular Expressions are an extremely powerful way to match text in a document. However, we are just using this text, so we could either guess, or cut and paste the text into a text reader and identify the position of the first content (i.e., how many words in is the first word). That is the route we are going to take. We found that the content begins at word 320, so let’s make a slice of the text from word position 320 to the end.

dq_text = don_tokens[320:]

Now print the first 30 words to see if it worked:

print(dq_text[:30]

Finally, if we want to use the NLTK specific functions:

  • concordance
  • similar
  • dispersion_plot
  • or others from the NLTK book we would have to make a specific NLTK Text object.
dq_nltk_text = nltk.Text(dq_text)

And we could check that it worked by running:

type(dq_nltk_text)

But if we only need to use the built-in Python functions, we can just stick with our list of words in dq_text.

Challenges for lesson 14

Assignment: Challenge

Using the dq_text variable:

  • Remove the stop words
  • Remove punctuation
  • Remove capitalization
  • Lemmatize the words

If you want to spice your challenge up, do the first three operations in a single if statement. Google “python nested if statements” for examples.

  1. Lowercase, remove punctuation and stop words:

    dq_clean = []
    for word in dq_text:
        if word.isalpha():
            if word.lower() not in stops:
                dq_clean.append(word.lower())
    print(dq_clean[:50])
  2. Lemmatize:

    from nltk.stem import WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()
    
    dq_lemmatized = []
    for t in dq_clean:
        dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))

Questions

Try again!

Check all sentences below that are correct:

(Select all that apply)

Terms Used in Lesson

Can you define the terms below? Hover over each of them to read a preview of the definitions.

Metadata

Any data that describes your book: title, subtitle, author bio, book description, price, publication date, ISBN, etc.

See term page

Regular Expressions

A powerful way to match text in a document, with a sequence of characters that define a search pattern.

See term page

Workshop overall progress