top band

Wednesday 1:20 p.m.–4:40 p.m.

Machine Learning with Scikit-Learn (II)

Olivier Grisel

Audience level:


This tutorial will offer an overview of common usage and methodological patterns when using Scikit-Learn to build predictive models. In particular we will highlight common strategies to deal with data with heterogeneously typed attributes with pandas dataframes, model evaluation and tuning. Finally if time permits we will explore the specificities of working with textual data.


- Data munging for predictive modeling with pandas and scikit-learn Building predictive models first requires shaping the data in the right format to meet the mathematical assumptions of machine learning algorithms. In this session we will introduce the pandas data frame datastructure for munging heterogeneous data into a representation that is suitable for most scikit-learn models. In particular we address problems such as missing value imputation and categorical variables. We will illustrate those concepts by combining pandas-based feature engineering with scikit-learn Logistic Regression, Random Forests and Gradient Boosted Trees. - Model evaluation and selection Building a predictive model is a fundamentally iterative process: design a model, train it, analyze errors, fix the model design and iterate. To iterate quickly in the right direction it is therefore very important to understand how models fail. This session will dive into methodological concepts and scikit-learn tools to evaluate models such as cross validation, overfitting and underfitting, regularization, plotting validation curves and learning curves. Finally we also cover how some parts of the model design can be automated via parameter search (exhaustive Grid Search or Random Search). - Working with text data Machine Learning with text data can be very useful for social networks analytics for instance to perform sentiment analysis. Extracting a "machine learnable" representation from raw text is an art in itself. In this session we will introduce the bag of words representation and its implementation in scikit-learn via its text vectorizers. We will discuss preprocessing with NLTK, n-grams extractions, TF-IDF weighting and the use of SciPy sparse matrices. Finally we will use that data to train and evaluate of a Naive Bayes classifier and a Linear Support Vector Machine.

Student Handout

No handouts have been provided yet for this tutorial

bottom band background