Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

12. Data Cleaning: Results

Now that we’ve seen some of the differences between both, we will proceed using our lemmatized corpus, which we saved as text1_clean:

my_dist = FreqDist(text1_clean)

If nothing happened, that is normal. Check to make sure it is there by calling for the type of the “my_dist” object.


The result should say it is a nltk probability distribution (nltk.probability.FreqDist). It doesn’t matter too much right now what that is, only that it worked. We can now plot this with matplotlib‘s function called plot. We want to plot the first 20 entries of the my_dist object.


nltk plot distribution We’ve made a nice image here, but it might be easier to comprehend as a list. Because this is a special probability distribution object we can call the most_common on this, too. Let’s find the twenty most common words:


What about if we are interested in a list of specific words—perhaps to identify texts that have biblical references. Let’s make a (short) list of words that might suggest a biblical reference and see if they appear in Moby Dick. Set this list equal to a variable:

b_words = ['god', 'apostle', 'angel']

Then we will loop through the words in our cleaned corpus, and see if any of them are in our list of biblical words. We’ll then save into another list just those words that appear in both.

my_list = []
for word in b_words:
    if word in text1_clean:

And then we will print the results.


You can obviously do this with much larger lists and even compare entire novels if you wish, though it would take a while with this approach. You can use this to get similarity measures and answer related questions.

Challenges for lesson 12

Assignment: Challenge

  1. Try to get the same result of the loop above (the one with “my_list”), but this time with a list comprehension. Save this other list as “my_list2”.

  2. Compare both lists to see if they are identical.

  1. A solution using a list comprehension would look like this:

    my_list2 = [word for word in b_words if word in text1_clean]
  2. To compare the lists, you could run the following command:

    my_list == my_list2


Try again!

Which one of the following sentences is correct:

(Select one of the following)

Workshop overall progress