Storm: the Hadoop of Realtime Stream Processing

Gabriel Grant

Type:: Talk
Audience level:: Intermediate
Category:: Big Data

March 10th 2:15 p.m. – 2:55 p.m.

Description

Twitter's new scalable, fault-tolerant, and simple(ish) stream programming system... with Python!

Abstract

Storm is a high-volume, continuous, reliable stream processing system developed at BackType and recently open-sourced by Twitter. Though most of the system (and it's documentation) is written in Java-based languages, it is possible to use in a Python environment with Python-based analysis code. At DotCloud (our application-platform-as-a-service) we're doing just that, and we'll be showing how you can too.

We collect a lot of data: we have tens of thousands of customers, many of whom have dozens of services running on our platform, each of which in turn produces dozens of metrics every second. All in all, we're dealing with millions of datapoints per minute. Storm will be the third iteration of our metrics system, an attempt at standardizing a number of previously-distinct pieces of our infrastructural software, to enable automated, real-time reactions to changes in the platform's state.

We'll start by touching on what problems Storm is (and isn't) trying to solve and why it's model is so powerful, informed by our previous attempts to solve the stream processing problem. We'll then move on to a deep dive into how to get Storm up and running with the most Python and least Java-enduced-pain possible and finish up with tips to solve some of the challenges we've encountered while adopting Storm into our Python-based development process.

Outline

The Problem:

What is stream processing?
- High volume, Continuous, Reliable data analysis
How do people solve this today?

The Solution:

Storm's overall model
Automatic parts
Topologies
ZeroMQ
Why is this solution better?

The Hard Part, Made Simple:

Build a topology
Code your processors

The Simple Part, Made Hard (made simple):

Development
Testing
Deployment
Java (ugh)
- Clojure (less ugh)