Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

15. Part-of-Speech Tagging

Note that we are going to use the pre-cleaned, dq_text object for this section. POS (Part-of-Speech) tagging is going through a text and identifying which part of speech each word belongs to (i.e., Noun, Verb, or Adjective). Every word belongs to a part of speech, but some words can be confusing.

  • Floyd is happy.
  • Happy is a state of being.
  • Happy has five letters.
  • I’m going to Happy Cat tonight. Therefore, part of speech is as much related to the word itself as its relationship to the words around it. A good part-of-speech tagger takes this into account, but there are some impossible cases as well:
  • Wanda was entertaining last night. Part of Speech tagging can be done very simply: with a very small tag set, or in a very complex way: with a much more elaborate tag set. We are going to implement a compromise, and use a neither small nor large tag set, the Penn Tree Bank POS Tag Set. This is the tag set that is pre-loaded into NLTK. When we call the tagger, we expect it to return an object with the word and the tag associated. Because POS tagging is dependent upon the stop words, we have to use a text that includes the stop words. Therefore, we will go back to using the dq_text object for this section. Let’s try it out. Type:
dq_tagged = nltk.pos_tag(dq_text)

Let’s inspect what we have:

print(dq_tagged[:10])

This is a list of ordered tuples. The Python native type tuple is similar to a list, but can’t be changed once it is created. They are also denoted with parentheses, rather than square brackets. Each element in the list is a tuple—or a pairing—consisting of (word, POS-tag). This is great, but it is very detailed. I would like to know how many nouns, verbs, and adjectives I have. First, I’ll make an empty dictionary to hold my results. (If you don’t know what a dictionary is and how they work, you can check a quick explanation here.) After that, I will go through this list of tuples and count the number of times each tag appears. Every time I encounter a new tag, I’ll add it to a dictionary and then increment by one every time I encounter that tag again. Let’s see what that looks like in code:

tag_dict = {}
# For every word/tag pair in my list,
for (word, tag) in dq_tagged:
    if tag in tag_dict:
        tag_dict[tag]+=1
    else:
        tag_dict[tag] = 1

Now let’s see what we got:

tag_dict

This would be better with some order to it, but dictionaries are made to be unordered. When we google “sort dictionaries python” we find a solution in our great friend Stack Overflow. Even though we cannot sort a dictionary, we can get a representation of a dictionary that is sorted. Don’t worry too much about understanding the following code, as it uses functions and methods we have not discussed, and are out of the scope of this course. It is useful to learn to reuse pieces of code even when we don’t fully understand them. Now let’s do it and find out what the most common tag is.

tag_dict_sorted = sorted(tag_dict.items(),
             reverse=True,
             key=lambda kv: kv[1])

Now let’s check out what we have:

print(tag_dict_sorted)

Your result should show that NN is the most common tag. We can look up what NN means in the Penn Tree Bank. Looks like NN is a Noun, singular or mass. Great! This information will likely help us with genre classification, or identifying the author of a text, or a variety of other functions.

Challenges for lesson 15

Questions

Try again!

Which of the following are correct?

(Select all that apply)

Terms Used in Lesson

Can you define the terms below? Hover over each of them to read a preview of the definitions.

Dictionaries

A collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values.

See term page

part-of-speech (POS) tagging

A way to identify the category of words in a given text. For a list of POS classification, see the Penn Tree Bank Tag Set

See term page

Tuples

A collection which is ordered and unchangeable. In Python tuples are written with parenthesis.

See term page

Workshop overall progress