PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Tuesday 1:55 p.m.–2:25 p.m.

Build Serverless Realtime Data Pipelines with Python and AWS Lambda

Mercedes Coyle

Audience level:


At Scripps Networks Living, we operate a network of video players generating around 100 million events per day. In order to process, store, and analyze this data, we operate batch and realtime data pipelines based off of Lambda Architecture principles. After outgrowing our original events system, we rebuilt it from the ground up based on AWS Services and the learnings from our original system.


Our previous speed layer had been successful in displaying a window into the data in real time, with access to 100 million events indexed per day. The ability to easily query realtime data has helped us quickly detect changes in player health, potential stream and ads issues, and system health. However, it was suffering greatly from its successes - technical debt incurred as the needs outgrew the system caused engineering more maintenance and downtime than development time. Built in about a month and held together with chewing gum and duct tape, it was time to rebuild it from the ground up. With the announcement of AWS Lambda at re:Invent this past fall, we saw an opportunity to reduce our infrastructure footprint and take advantage of Amazon’s plethora of infrastructure services. In addition to rebuilding our realtime system with zero infrastructure, Lambda - and Python - allowed us the flexibility to route, clean, and analyze our data in realtime, and egress it to various services (Redshift, Elasticsearch, HDFS, etc). In this talk, I’ll cover an architecture redesign from a homegrown system using Nginx, Rsyslog, and the ELK stack to a completely managed infrastructure using Amazon’s API Gateway, Kinesis, and Lambda. I’ll also talk about performance implications for python code running on Lambda (vs JVM based languages) and will walk through some example code for getting started with Python on Lambda.