Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

8. Lexical Density

Now we can calculate the lexical density, the number of unique words per total words. Statistical studies have shown that lexical density is a good metric to approximate lexical diversity—the range of vocabulary an author uses. For our first pass at lexical density, we will simply divide the number of unique words by the total number of words:

len(set(text1_tokens)) / len(text1_tokens)

If we want to use this metric to compare texts, we immediately notice a problem. Lexical density is dependent upon the length of a text and therefore is strictly a comparative measure. It is possible to compare 100 words from one text to 100 words from another, but because language is finite and repetitive, it is not possible to compare 100 words from one to 200 words from another. Even with these restrictions, lexical density is a useful metric in grade level estimations, vocabulary use and genre classification, and a reasonable proxy for lexical diversity. Let’s take this constraint into account by working with only the first 10,000 words of our text. First we need to slice our list, returning the words in position 0 to position 9,999 (we’ll actually write it as “up to, but not including” 10,000).

text1_slice = text1_tokens[0:10000]

Now we can do the same calculation we did above:

len(set(text1_slice)) / len(text1_slice)

This is a much higher number, though the number itself is arbitrary. When comparing different texts, this step is essential to get an accurate measure.

Challenges for lesson 8

Assignment: Challenge

Let’s compare the lexical density of Moby Dick with Sense and Sensibility. Make sure to:

  1. Make all the words lowercase and remove punctuation.
  2. Make a slice of the first 10,000 words.
  3. Calculate lexical density by dividing the length of the set of the slice by the length of the slice.

Remember to be aware of the ethical implications for the conclusions that we might draw about our data. What assumptions might we be reifying about these writers?

text2_tokens = []
for t in text2:
    if t.isalpha():
        t = t.lower()

text2_slice = text2_tokens[0:10000]

len(set(text2_slice)) / len(text2_slice)


Try again!

Check all sentences below that are correct:

(Select all that apply)

Workshop overall progress