Change the future

Advanced Machine Learning with scikit-learn

Those are the setup instructions to prepare the tutorial:

Advanced Machine Learning with scikit-learn

Dependencies

We will use Python 2.7 as support for Python 3 is not yet 100% there... (working on it). Python 2.6 should also mostly work for the tutorial.

We will need the following packages:

  • numpy >= 1.3
  • scipy >= 0.7
  • matplotlib (latest stable will work, probably older versions too if not too old)
  • scikit-learn >= 0.13 (or current master branch from github) installation instructions
  • IPython >= 0.13.1 (or current master branch from github): installation instructions
  • psutil >= 0.6.1
  • (optionally: StarCluster current develop branch from github and Amazon EC2 credentials)

Under Windows, the easiest way to install recent binary packages for all of this is probably to get them from Christoph Gohlke's Python Package binary archive.

Be careful downloading the 32 bit versions if you have the 32 bit version of Python or the 64 bit otherwise. We won't need more than 2GB or RAM so both versions should work for the tutorial.

Check your installation

Launch a new IPython notebook session by typing the following in a console (without the $ prompt):

$ ipython notebook

The web browser should open a new window or tab for the IPython user interface: click the "New Notebook" button, then try to import all the modules by typing:

In [1]: import numpy
In [2]: import scipy
In [3]: import pylab
In [4]: import sklearn
In [5]: import IPython.parallel
In [6]: import psutil

If get any error message, please send me and email at olivier.grisel@ensta.org with [PyCon 2013 Tutorial] in the object and:

  • the name and version of your operating system (e.g. Windows 7, Ubuntu 12.04, OSX 10.8)
  • the versions of all the afore-mentioned packages you installed
  • how you installed those packages (e.g. using pip or some binary packages)
  • if under windows: do you use python 32 bit or 64 bit?
  • the complete traceback of the error

Tutorial Material

Updated: download the dataset archive: datasets.zip (~100MB)

Updated: download the tutorial material archive from github: parallel_ml_tutorial-master.zip and unzip it.

Or:

git clone https://github.com/ogrisel/parallel_ml_tutorial.git

You can then put the datasets.zip inside the parallel_ml_tutorial folder and run:

python fetch_data.py

from there so as to unzip the datasets and make the data files ready.

There will also be a set of USB keys with the material available during the tutorial itself but it's faster to download it before the session.

You can also have a look at the README of the parallel_ml_tutorial repo on github.

Refresh your NumPy and scikit-learn

scikit-learn uses the numpy array datastructure extensively. If you are not familiar with it, you should have a look at the first chapters of this tutorial. You should also get familiar with the scipy sparse datastructures such as CSR and COO matrices.

This tutorial targets people with prior experience will scikit-learn. If you are new to scikit-learn and have not registered for Jake's introductory tutorial at PyCon, it is strongly advised to follow the tutorials from the official documentation or from the SciPy Lecture Notes.