PyCon Pittsburgh. April 15-23, 2020.

Talk: Life Beyond Yaml: Bridging Data Science and Data Infrastructure with Apache Airflow

Presented by:

Daniel Imberman

Description

When supporting a data science team, data engineers are tasked with building a platform that keeps a wide range of stakeholders happy. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. Collaboration between these stakeholders can be difficult, as every data science pipeline has a unique set of constraints and system requirements (compute resources, network connectivity, etc). For these reasons, data engineers strive to give their data scientists as much flexibility as possible, while maintaining an observable and resilient infrastructure.

In recent years, Apache Airflow (a Python-based task orchestrator developed at Airbnb) has gained popularity as a collaborative platform between data-centric Pythonistas, and infrastructure engineers looking to spare their users from verbose and rigid yaml files. Apache Airflow exposes a flexible pythonic interface that can be used as a collaboration point between data engineers and data scientists. Data engineers can build custom operators that abstract details of the underlying system and data scientists can use those operators (and many more) to build a diverse range of data pipelines.

For this 30 minute talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, hyper-parameter-tuned, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.