Change the future

Powering Recommendations with Distributed Computing using Python and MapReduce

Marcel Caraciolo

Audience level:
Big Data


It will present how to build scalable recommender systems with Map-Reduce Paradigm and Python (including the packages Crab, MrJob, Scipy and Numpy). Recommender Systems are systems that analyzes the user preferences in data format and estimate the items of interest for that user. It is applicable in several domains such as search, medicine, e-commerces and social networks.


Recommender systems is one of the main topics nowadays in machine learning. Recommendation algorithms can be employed on social networks, e-commerces in order to recommend items (such as products, news and friends) based on historical data and behavior on-line from users. But real world systems may have thousands of products and users that must be computed the correlation between them.

It's challenging to process this amount of data. Distributed computation using MapReduce is a powerful solution that can be applied for those type of problems described by large datasets and high computation. There are several frameworks in Python that can be applied for this task and machine learning problems, even for recommender systems such as mr-job, disco, etc.

This poster describes a new extension of Crab, which provides the infrastructure to develop and test recommender algorithms using Scipy, Numpy and Matplotlib packages, to support distributed computing using Map-Reduce and Python. It will be presented how you can use Python and MapReduce with our framework and learn how it could be also used in other machine learning scenarios.

It will present the workflow for a scalable item-to-item collaborative filtering MapReduce flow with Python, Map-Reduce and Mr-job.