Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

11. Data Cleaning: Stemming Words

The code to implement this and view the output is below:

from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

The Porter is the most common Stemmer. Let’s see what stemming does to words and compare it with lemmatizers:

print(porter_stemmer.stem('berry'))
print(porter_stemmer.stem('berries'))
print(wordnet_lemmatizer.lemmatize('berry'))
print(wordnet_lemmatizer.lemmatize('berries'))

Stemmer doesn’t look so good, right? But how about checking how stemmer handles some of the words that our lemmatized “failed” us?

print(porter_stemmer.stem('abandon'))
print(porter_stemmer.stem('abandoned'))
print(porter_stemmer.stem('abandonedly'))
print(porter_stemmer.stem('abandonment'))

Still not perfect, but a bit better. So the question is, how to choose between stemming and lemmatizing? As many things in text analysis, that depends. The best way to go is experimenting, seeing the results and chosing the one that better fits your goals. As a general rule, stemming is faster while lemmatizing is more accurate (but not always, as we just saw). For academics, usually the choice goes for the latter. Anyway, let’s stem our text with the Porter Stemmer:

t1_porter = []
for t in text1_clean:
    t_stemmed = porter_stemmer.stem(t)
    t1_porter.append(t_stemmed)

Or, if we want a faster way:

t1_porter = [porter_stemmer.stem(t) for t in text1_clean]

And let’s check the results:

print(len(set(t1_porter)))
print(sorted(set(t1_porter))[:30])

A very different list of words is produced. This list is shorter than the list produced by the lemmatizer, but is also less accurate, and some of the words will completely change their meaning (like ‘berry’ becoming ‘berri’).

Challenges for lesson 11

Questions

Try again!

Check all sentences below that are correct:

(Select all that apply)

Terms Used in Lesson

Can you define the terms below? Hover over each of them to read a preview of the definitions.

Stemming

A process of collapsing words in an attempt to reduce the number of words, and get a realistic understanding of the meaning of a text. Stemming cuts off affixes to find the root (or the …

See term page

Workshop overall progress