PyCon Pittsburgh. April 15-23, 2020.

Talk: Deploying Python at Scale with Dask

Presented by:

Matthew Rocklin

Description

This talk discusses the challenges and options to scale Python with Dask on distributed hardware. We particularly focus on how Dask gets deployed on cluster resource managers like Kubernetes, Yarn, and the cloud today.

The Python data science stack (Numpy, Pandas, Scikit-Learn and others) has become the gold standard in most data centric fields due to a combination of intuitive high level APIs and efficient low-level code. However, these libraries were not originally designed to scale beyond a single CPU or data that fits in memory. Over the last few years the Dask library has worked with these libraries to provide scalable variants, which do run on multi-core workstations or on distributed clusters. This has allowed advanced users the ability to scale Python to handle 100+TB datasets.

However, deploying Dask within an institution remains a challenge. How do we balance load across many machines? How do we share with other distributed systems running on those same machines? How do we control access and provide authentication and security? As more institutions adopt Python to handle scalable computation these questions arise with greater urgency. This talk discusses the options today to deploy Dask securly within an institution on distributed hardware, and dives into some examples where this has had a large positive social impact.