Note that this site is in currently in version 1.0.0-alpha.   Some functionality may be limited.

Text Analysis with Python and NLTK

Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text faster than humanly possible. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.

In this workshop, you will learn to:

  • Compare frequency distribution of words in a text to quantify the narrative arc
  • How to clean and standardize your data, including powerful tools such as stemmers and lemmatizers
  • How to prepare texts for computational analysis, including strategies for transforming texts into numbers
  • How to tokenize your data and put it in a format compatible with Natural Language Toolkit.
  • How to use NLTK methods such as concordance and similar
  • Transform any document that you have (or have access to) in a .txt format into a text that can be analyzed computationally
  • Understand stop words and how to remove them when needed.
  • Utilize Part-of-Speech tagging to gather insights about a text

Before you get started

In this section, we want to introduce some central steps that you want to take before you get started with this workshop. For instance, there are workshop suggestions that you may want to engage with before you start this workshop, some required or recommended software installations, some files from external sources to download, etc.

Workshops

This is a list of workshops that we suggest you engage with before you get started with this one. They are listed here as they contain some central concepts or tools that you may need before you can digest all the information you will be presented in this workshop.

Introduction to PythonRequired

This workshop relies heavily on concepts from the Python workshop, and having a basic understanding of how to use the commands discussed in the workshop will be central for anyone who wants to learn about text analysis with Python and NLTK.

Quickstart

Software installations

Some software is required for you to participate in this workshop, other is recommended. This is a list of the prerequisite installations that are required of you, a link to each of their instructions (your operating system should have been highlighted below, as long as we have them available) and an indication as to whether it is required or not.

Python (and Anaconda)Required
This workshop uses Python and you will need to have a Python installation. If you choose to install a different version of Python, make sure it is version 3 as other versions will not work with our workshop.
Natural Language ToolkitRequired
You will need to install the NLTK package into your Python packages for the purposes of this workshop. This guide will help you along the way.

Insights

This is a list of recommended and required "insights" that you may want to engage with. Insights can be considered "mini-workshops," about a particular smaller topic or task that you may need to complete. In this list, you will find Insights that we have deemed important or interesting enough for you, and that are related to this workshop.

Contexts

Why am I learning this? Why does it matter? How will it help my project? Learning new digital skills is an investment of your valuable time, so it is reasonable to want to know—essentially—what will I get out of taking this workshop? The materials below help situate the skills you are about to learn within a larger context of how they are used, by whom, and to what ends.

Ethical considerations

Digital tools and the skills required to use them are part of our culture and, therefore, never neutral. Digital humanists and social scientists consider the ethical challenges and responsibilities of the tools and methods that they use. The following materials are designed to introduce you to issues you may want to consider as you learn this new skill and decide how to integrate it into your own research and teaching.

In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be careful when analizing it.

It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not tell the story, our analysis does. We must be careful not to draw hasty and simplistic conclusions for things that are complex. Just because we found out that author A uses more unique words than author B, does it mean that A is a better writer than B?

Readings before you get started

The readings listed below situate what you are about to learn in cultural contexts, such as a particular humanities or social science field, the information or computer sciences, or popular discourse. The purpose of the readings is to provide a theoretical framework you can use to contextualize how you intend to use the skill or tool introduced in this workshop.

Projects related to Text Analysis with Python and NLTK

The following are sample projects that use the skill or tool (either implicitly or explicitly) that you are about to learn. Some skills that are foundational may seem not to lead to a specific project goal that you have in mind. You might be surprised to learn that the following projects depend on the skills learned in this workshop.

Cheat sheets related to Text Analysis with Python and NLTK

An introduction to what cheat sheets are and what they do in our frontmatter section (and why we have them on the site altogether).

Meet your instructor

I am a Ph.D. student in the History department, specializing in the History of Capitalism and history of Latin America. I research the role of transnational capital in Latin American urban development. As a Digital Fellow on the Graduate Center Digital Initiatives, I support member of the GC community to find the right computing tools for their research. I lead the Python Users’ Group, and write and lead workshops on Python, Markdown, Network Analysis, among others. I am interested in programming, video games, digital tools, teaching research computing, Emacs, running, board games, among other things.

avatar