PyCon Pittsburgh. April 15-23, 2020.

Tutorial: Scalable Computing with Dask

Presented by:

Tom Augspurger, Matthew Rocklin

Description

Python is a great language for data analysis. We have a rich ecosystem of libraries like pandas, NumPy, and scikit-learn with nice APIs and great performance. However, these libraries are mostly limited to in-memory datasets and often use just a single CPU core on a single machine.

We’ll introduce Dask, a library that natively scales Python. Every attendee will be given their own Dask cluster to analyze larger-than-memory datasets using familiar APIs and tools. We’ll see how Dask works with libraries like NumPy, pandas, and scikit-learn to scale out to larger problems.