Presentation: mrjob: Snakes on a Hadoop

Wednesday 9 a.m.–12:20 p.m.

mrjob: Snakes on a Hadoop

Jim Blomo

Audience level:: Intermediate
Category:: Python Libraries

Description

This tutorial will take participants through basic usage of mrjob by writing analytics jobs over Yelp data. mrjob lets you easily write, run, and test distributed batch jobs in Python, on top of Hadoop. Hadoop is a MapReduce platform for processing big data but requires a fair amount of Java boilerplate. mrjob is an open source Python library written by Yelp used to process TBs of data every day.

Abstract

Yelp had a problem: how to process a mountain of logs in a distributed, fault-tolerant way. Sounds like a perfect fit for Hadoop, right? Just one problem: we love Python and wanted to use its conciseness and power, not to mention existing libraries and business logic, to analyze our data. So we wrote [mrjob][1], which tries to combine the best of the Hadoop and Python worlds. With mrjob you can easily write MapReduce jobs in native Python, which can be executed via Hadoop Streaming on in-house Hadoop clusters or services like Amazon’s [Elastic MapReduce][2]. This tutorial will start with a brief introduction to the MapReduce paradigm. What are the trade-offs this approach provides? By giving up global state and variables, MapReduce buys you scalability and fault tolerance. Then we will cover an example: counting user activity in a log. We will walk through a simple mrjob program, then ask participants to tweak the code to achieve different outputs. mrjob has the option of running locally, so setup will be simple and attendees can focus on learning the framework instead of messing with cluster configurations. Next we will explain how to use mrjob to analyze more complex data using the [Yelp dataset][3]. We’ll introduce some data science concepts, such as user-user similarity, and show how to calculate these metrics in mrjob. Again, participants will be asked to extend the example to improve its sophistication. Finally, we will demonstrate how to deploy mrjob to a production cluster using Amazon’s Elastic MapReduce service. Participants will walk away with the ability to deploy an mrjob based solution in their workplace, from testing, through deployment, to monitoring. [1]: https://github.com/Yelp/mrjob [2]: http://aws.amazon.com/elasticmapreduce/ [3]: http://www.yelp.com/dataset_challenge

Student Handout

No handouts have been provided yet for this tutorial