Change the future

Luigi - Batch Data Processing in Python

Elias Freider

Audience level:
Experienced
Category:
Big Data

Description

Luigi is Spotify's recently open sourced Python framework for batch data processing including dependency resolution and monitoring. The framework helps you to organize the execution of inter-dependent recurring tasks and has a streamlined interface for integrating with HDFS and Hadoop MapReduce.

Abstract

Background

Streaming music company Spotify has terabytes of data being generated by backend services every day. This data needs to be joined, aggregated and processed in other ways to power a wide range of applications including music recommendations, internal dashboards, financial reports and advertisement simulations. These and many other tasks need to run on a daily or even an hourly basis.

The Luigi framework

Luigi is Spotify's recently open sourced Python framework for batch data processing. It was created for managing task dependencies, automating scheduling of tasks, monitoring execution and to provide templates for common batch processing tasks. It is the result of years of experience of dealing with big data at Spotify. At the core, Luigi is a Python dependency resolver and file system abstraction layer conceptually similar to GNU Make or Apache Oozie, but the focus of the project has been to simplify distributed computing using Python and Apache Hadoop. The framework itself is generic enough to be used for everything from simple task execution and monitoring on a local work station, to launching huge chains of processing tasks that can run in synchronization between many machines over the span of several days.

Luigi comes with file system abstraction for the Hadoop Distributed File System (HDFS) and an integrated interface to Hadoop Streaming MapReduce. You can easily create extensions for additional file systems and data processing systems. Powerful extensions for working with relational databases and NoSQL systems like Cassandra are currently in development.

This poster will explain the architecture of Luigi and demonstrate how its basic data structures can be combined to build parametrized tasks and execution trees. It's aimed at Python developers that want to run multi-step batch computations, regularly perform bulk transfers of data between systems or just want to get started with Hadoop and big data.