Change the future

Wednesday 9 a.m.–12:20 p.m.

A beginner's introduction to Pydata: how to build a minimal recommendation engine.

Diego Maniloff, Amr Hiram, Zach Howard

Audience level:
Big Data


In this tutorial we'll set ourselves the goal of building a minimal recommendation engine, and in the process learn about Python's excellent Pydata and related projects: numpy, pandas, and pytables.

A recommendation engine is a software system that analyzes large amounts of transactional data and distills personal profiles to present its users with relevant products/information/content.


Environment setup

Checklist of the required software before entering the tutorial. Please make sure you can invoke a Python shell in your system and that pip install <blah> works correctly.

  1. Python >= 2.6
  2. Virtualenv
  3. Pip

Update: See updated tutorial preparation instructions at A beginner's introduction to Pydata: how to build a minimal recommendation engine

The recommendation problem

Estimated duration: 10'

  1. Definition of a recommender system
  2. Problem statement

Different types of recommender systems

Estimated duration: 15'

  1. Content-based recommenders
  2. Collaborative filters
  3. Hybrid solutions

Our goal: a minimal content-based recommendation engine

Estimated duration: 15'

  1. Problem domain for our example: recommending grocery items
  2. Sample dataset
  3. Flow-chart of the intended system
  4. Write pseudo-code for the chosen recommendation strategy

A sample in-memory system: intro to Numpy

Estimated duration: 40'

  1. pip install numpy
  2. The ndarray.
  3. Operations between 1-d arrays and 2-d arrays
  4. Basics on broadcasting rules
  5. Translate recommendation strategy into a simple numpy-based routine

Dealing with missing data: intro to Pandas

Estimated duration: 40'

  1. pip install pandas
  2. The Series and DataFrame
  3. Descriptive stats of our sample dataset
  4. TBD

Adding a persistence layer: intro to Pytables

Estimated duration: 40'

  1. pip install tables
  2. The HDF5 format
  3. Caching intermediate results
  4. TBD

Putting it all together

Estimated duration: 20'

  1. Back to the flow-chart: filling out the implementation details
  2. Where to go next