Storm: the Hadoop of Realtime Stream Processing

Audience level:
Big Data
March 10th 2:15 p.m. – 2:55 p.m.


Twitter's new scalable, fault-tolerant, and simple(ish) stream programming system... with Python!


Storm is a high-volume, continuous, reliable stream processing system developed at BackType and recently open-sourced by Twitter. Though most of the system (and it's documentation) is written in Java-based languages, it is possible to use in a Python environment with Python-based analysis code. At DotCloud (our application-platform-as-a-service) we're doing just that, and we'll be showing how you can too.

We collect a lot of data: we have tens of thousands of customers, many of whom have dozens of services running on our platform, each of which in turn produces dozens of metrics every second. All in all, we're dealing with millions of datapoints per minute. Storm will be the third iteration of our metrics system, an attempt at standardizing a number of previously-distinct pieces of our infrastructural software, to enable automated, real-time reactions to changes in the platform's state.

We'll start by touching on what problems Storm is (and isn't) trying to solve and why it's model is so powerful, informed by our previous attempts to solve the stream processing problem. We'll then move on to a deep dive into how to get Storm up and running with the most Python and least Java-enduced-pain possible and finish up with tips to solve some of the challenges we've encountered while adopting Storm into our Python-based development process.


The Problem:

  • What is stream processing?
    • High volume, Continuous, Reliable data analysis
  • How do people solve this today?

The Solution:

  • Storm's overall model
  • Automatic parts
  • Topologies
  • ZeroMQ
  • Why is this solution better?

The Hard Part, Made Simple:

  • Build a topology
  • Code your processors

The Simple Part, Made Hard (made simple):

  • Development
  • Testing
  • Deployment
  • Java (ugh)
    • Clojure (less ugh)