top band

Sunday 1:10 p.m.–1:40 p.m.

streamparse: real-time streams with Python and Apache Storm

Andrew Montalenti

Audience level:
Python Libraries


Real-time streams are everywhere, but does Python have a good way of processing them? Until recently, there were no good options. A new open source project, streamparse, makes working with real-time data streams easy for Pythonistas. If you have ever wondered how to process 10,000 data tuples per second with Python -- while maintaining high availability and low latency -- this talk is for you.


Until recently, the only good option for real-time stream processing in Python was to build your own home-grown solution atop worker-and-queue frameworks like [rq][rq] or [celery][celery]. Though these projects are good for distributing workload across Python processes and machines, they do not have built-in mechanisms for message reliability, fault tolerance, or multi-machine cluster management. A new open source project that was developed in the last year and has recently hit a major 1.0 milestone, [streamparse][streamparse], finally makes working with real-time data streams easy for Pythonistas. If you have ever wondered how to process tens of thousands of data tuples per second with Python using long-lived processes -- while maintaining fast throughput, high availability, and low latency -- this talk will give you an overview and deep dive. ## Detailed Talk Overview ### What is Storm? [Apache Storm][storm] is a battle-tested stream processing framework that is already used in production by the likes of Twitter, Spotify, and Wikipedia. Storm has been shown to handle 1,000,000 tuples per second per node in benchmarks (reported by Nathan Marz, author of ["Big Data"][big-data] by Manning Press). It has also been shown to scale up to 1,200 nodes across a computation cluster (reported by [Twitter][twitter-storm]). In other words, it is good stuff! Before streamparse, using Storm with Python was a bit painful. Fortunately, streamparse makes using Storm easy and Pythonic, in the same way that [mrjob][mrjob] made using Hadoop easy and Pythonic. ### streamparse components streamparse has four major components: 1. A command-line tool, `sparse`, that makes creating Python projects that will work with Storm very easy. 2. A Python module, `streamparse`, that implements Storm's multi-lang protocol; we call this the IPC (inter-process communication) layer. 3. Extensions for `Fabric` that allow you to manage a remote cluster of Storm machines, complete with Python dependency management. 4. A thin Java interop layer written in Clojure and accessed with `lein` that makes it possible for you to manage a Storm cluster and compile Storm topologies from the command line; the Java bits are hidden from the streamparse user so they can work in pure Python. These will be covered in the talk. ### Real-world stream processing This talk will also provide an overview of stream processing challenges, and put this in the context of streamparse's (and Storm's) internal architecture. Attendees will be able to use this knowledge to quickly build their own Python-on-Storm topologies, for example implementing a scalable "real-time word counter topology" in Python using only a few keystrokes. The talk will conclude by showing how we currently use streamparse, Storm, and [Kafka][kafka] in production to process billions of page views per month of analytics data with sub-second latencies. [celery]: [rq]: [mrjob]: [storm]: [big-data]: [streamparse]: [twitter-storm]: [kafka]:
bottom band background