Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

9. Data Cleaning: Removing Stop Words

Thus far, we have been asking questions that take stop words and grammatical features into account. For the most part, we want to exclude these features since they don’t actually contribute very much semantic content to our models. Therefore, we will:

  1. Remove capitalization and punctuation (we’ve already done this).
  2. Remove stop words.
  3. Lemmatize (or stem) our words, i.e. “jumping” and “jumps” become “jump.” We already completed step one, and are now working with our text1_tokens. Remember, this variable, text1_tokens, contains a list of strings that we will work with. We want to remove the stop words from that list. The NLTK library comes with fairly comprehensive lists of stop words for many languages. Stop words are function words that contribute very little semantic meaning and most often have grammatical functions. Usually, these are function words such as determiners, prepositions, auxiliaries, and others. To use NLTK’s stop words, we need to import the list of words from the corpus. (We could have done this at the beginning of our program, and in more fully developed code, we would put it up there, but this works, too.) In the next cell, type:
from nltk.corpus import stopwords

We need to specify the English list, and save it into its own variable that we can use in the next step:

stops = stopwords.words('english')

Now let’s take a look at those words:

print(stops)

Now we want to go through all of the words in our text, and if that word is in the stop words list, remove it from our list. Otherwise, we want it to skip it. (The code below is slow, so it may take some time to process). The way we can write this in Python is:

text1_stops = []
for t in text1_tokens:
    if t not in stops:
        text1_stops.append(t)

A faster option, if you are feeling bold, would be using list comprehension:

text1_stops = [t for t in text1_tokens if t not in stops]

To check the result:

print(text1_stops[:30])

Verifying List Contents

Now that we removed our stop words, let’s see how many words are left in our list:

len(text1_stops)

You should get a much lower number.

For reference, let’s also check how many unique words there are. We will do this by making a set of words. Sets are the same in Python as they are in math, they are all of the unique words rather than all the words. So, if “whale” appears 200 times in the list of words, it will only appear once in the set.

len(set(text1_stops))

Challenges for lesson 9

Questions

Try again!

Check all sentences below that are correct:

(Select all that apply)

Terms Used in Lesson

Can you define the terms below? Hover over each of them to read a preview of the definitions.

Stop Words

Words that appear frequently in a language, often adding grammatical structure, but little semantic content.

See term page

Workshop overall progress