Tutorial files

Please download the following archive with tutorial material and exercises:

sklearn-tutorial.tar.gz

In order to follow the tutorial on your laptop you should install the following branch sklearn-tutorial of my scikit-learn repo.

It contains new stuff, in particular a much improved / simplified API for text processing. Here are some instructions to get you up and runnning.

Dependencies

In order to do so you will need:

  • python 2.6 or 2.7 (3.2 is not yet fully supported...)
  • numpy (1.5+)
  • scipy (0.8+)
  • matplotlib
  • nosetests

and optionally:

  • ipython (0.12 recommended)

numpy and scipy can get tricky to build from source (you will need a fortran compiler). It might be easier to use one of the following options:

Ubuntu / Debian

On Linux Ubuntu / Debian most of this will be fetched by running:

sudo apt-get build-dep python-scikits-learn

or

sudo apt-get build-dep python-sklearn

MacOSX

On MaxOSX you can use ScipySuperpack (needs python 2.7) or EPD Free.

Windows

On Windows you can use: Python XY or EPD Free.

Building scikit-learn

Note: you can do the following in a virtualenv if your prefer (recommended but not necessary).

Fetch the source from my branch using git:

git clone https://github.com/ogrisel/scikit-learn.git
cd scikit-learn
git fetch origin sklearn-tutorial
git checkout -b sklearn-tutorial origin/sklearn-tutorial

Alternatively you can download and unzip the following zip archive.

Under Linux / OSX you should be able to build by running the following command at the top of the source folder:

make inplace
pip install -e .

Then run the tests with (you will need nosetests):

make test

All tests should pass. You can ignore the warnings.

On Windows, I prepared a win32 installer (recommended) and a win32 egg (if you prefer to use distribute). I built both of them under win7 with the mingw32 compiler.

You can check that the install went well by launching python and trying:

>>> from sklearn.svm import SVC
>>> SVC().fit([[0], [1]], [0, 1])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=1.0,
 kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)

Alternatively, if you want to build your-self, please follow these build instructions.

To run the tests, type:

python setup.py build_ext -i
nosetests sklearn

Troubleshooting

If you have issues with the installation, please send me an email: olivier.grisel@ensta.org with detailed your platform information and any error message you get.

If you really cannot get my dev branch to build then fall-back to the latest stable release.

For windows users there is also an unofficial build for win64 here.

If you get the errors that look like the following when running the tests:

======================================================================
ERROR: Doctest: sklearn.datasets.base.load_sample_image
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose-1.1.2-py2.7.egg\nose\plugins\doctests.py", line 395, in tearDown
    delattr(builtin_mod, self._result_var)
AttributeError: _

you can safely ignore them: this is a a bug in the test runner rather than scikit-learn itself.

You can also ignore the following test failure if running python -c "import sklearn; sklearn.test()" after having installed the windows package:

AttributeError("'module' object has no attribute 'semi_supervised'",) != None

This is a packaging issue for a new module that won't be used during the tutorial.

Refresh your numpy

scikit-learn uses the numpy array datastructure extensively. If you are not familiar with it, you should have a look at the first chapters of this tutorial. You should also get familiar with the scipy sparse datastructures such as CSR and COO matrices.

community/tutorials/195 Recently modified by ogrisel: March 8, 2012, 12:54 a.m. (History) Edit