PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Sunday 1:20 p.m.–4:40 p.m.

Making an Impact with Python Natural Language Processing Tools

Hobson Lane, Dan Fellin, Jeremy Robin

Audience level:


Do your tweets get lost in the shuffle? Would you like to predict a tweet's impact before you hit send? Python now has all the tools to make this possible. Several Python packages for machine learning and natural language processing have reached "critical mass" and can now be combined to perform these and other powerful natural language processing tasks. This tutorial will teach you how.


### Prerequisites Students who have experience writing `python` scripts or modules and are familiar with the `string` manipulation and formatting capabilities built into python will have the necessary skill to complete this tutorial. In addition, any students who are familiar with *linear algebra*, and basic *statistics* concepts (like *probability* and *variance*) will be able to grasp the mathematics behind the tools assembled during the tutorial, but this is not required. Likewise, familiarity with `scikit-learn` and `pandas` would enable participants to incorporate more advanced features into their NLP pipeline. Also, students who are familiar with `git` and [GitHub]( will be able to follow along with the logistics of the workshop sessions more quickly and spend more time developing their NLP pipeline. ### Python Development Environment Students will need iPython, Pandas, NLTK, scipy, scikit-learn, and gensim installed on their laptops in order to run the examples in this tutorial and build the tweet impact predictor tool. Students can install these [requirements]( in one of 3 ways: 1. For those with a Linux environment, the dependencies can be installed either natively or within a `virtualenv` with. ``` pip install -r ``` 2. Alternative install recipes using Anaconda will be provided. 3. A [Vagrant VirtualBox customized for NLP]( has been packaged for those who want the power of Linux and [Python]( within their nonfree, closed-source OS. In addition, students have the option of installing a python [twitter API client]( rather than utilizing the preprocessed collection of twitter feeds provided with the course material. ### Overview Participants will develop a tweet natural language processing pipeline in three modules. The first section of the pipeline will be a natural language feature extractor and normalizer based on python builtins `collections`, `string`, and `re` combined with the powerful Pandas `DataFrame` data structure. The second section will utilize `scikit-learn` and `numpy` to simplify the feature set to a manageable number of features. It will find optimal combinations of reduced numbers of features that provide the greatest information about the subject matter of the tweets being processed. The final section of the pipeline will compute additional features not contained in the tweet text, including time of day, day of week, number of favorites, and number of retweets. Students will use these features to compute an "impact" score and train a machine-learning model to predict the impact of proposed (not yet sent) tweets. In the fourth, final workshop, participants will assess the performance of their existing machine learning pipeline, ask questions or get clarification about the performance of the pipeline, and optionally incorporate more advanced NLP techniques.

Student Handout

No handouts have been provided yet for this tutorial