PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Saturday 9 a.m.–12:20 p.m.

Machine Learning with Text in scikit-learn

Kevin Markham

Audience level:


Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.


It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on... In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into Pandas, and immediately apply what we've learned about vectorization. We'll use a couple NumPy tricks to gain insights from our document-term matrix, and then move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then practice this entire workflow again on a separate dataset of Yelp reviews. Finally, we'll discuss which parts of the process are worth tuning for improved performance. Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both Pandas and scikit-learn. However, no knowledge of advanced mathematics is required. Attendees will need to bring a laptop with scikit-learn and Pandas (and their dependencies) already installed. Installing the Anaconda distribution of Python is an easy way to accomplish this. Both Python 2 and 3 are acceptable.

Student Handout

No handouts have been provided yet for this tutorial