Change the future

Thursday 1:20 p.m.–4:40 p.m.

Advanced Machine Learning with scikit-learn

Olivier Grisel

Audience level:
Big Data


This tutorial will offer an in-depth experience of methods and tools for the Machine Learning practitioner through a selection of advanced features of scikit-learn and related projects. This tutorial targets developers already familiar with machine learning concepts and scikit-learn who wish to learn how to apply those tools on larger datasets using multicore machines or distributed clusters.


Scikit-learn is an actively developing python package providing implementations of many of the most popular and powerful machine learning methods used today.

Recently the popularity of scikit-learn was emphasized by its use by top contestants on machine learning challenges hosted by kaggle.

The goal of this tutorial is to share some recipes to fully leverage the library for predictive modeling. In particular we will cover the following points:

  • How to extract features from unstructured inputs such as text documents,
  • How do do model evaluation with Cross Validation,
  • How to do model selection with Grid Search,
  • How to analyze the type of errors made by a model (bias vs variance) and the common remedies,
  • How to leverage efficiently multicores architectures without running out of memory with joblib and numpy memmaping,
  • How to run distributed machine learning algorithms on a cheap transient Amazon EC2 cluster using IPython parallel and StarCluster,
  • How to build ensemble models.


  • scikit-learn 0.13 (or later if any) and dependencies: in particular numpy, scipy and matplotlib
  • IPython 0.13 (or later if any)
  • psutil 0.6.1+
  • Optional: Amazon EC2 credentials and StarCluster 0.93+

Update: See updated tutorial preparation instructions at Advanced Machine Learning with scikit-learn