Wednesday 1:20 p.m.–4:40 p.m. in Room 20

Parallel Data Analysis with Dask

Tom Augspurger, James Crist, Martin Durant

Description

The libraries that power data analysis in Python are essentially limited to a single CPU core and to datasets that fit in RAM. Attendees will see how dask can parallelize their workflows, while still writing what looks like normal python, NumPy, or pandas code. Dask is a parallel computing framework, with a focus on analytical computing. We'll start with `dask.delayed`, which helps parallelize your existing Python code. We’ll demonstrate `dask.delayed` on a small example, introducing the concepts at the heart of dask like the *task graph* and the *schedulers* that execute tasks. We’ll compare this approach to the simpler, but less flexible, parallelization methods available in the standard library like `concurrent.futures`. Attendees will see the high-level collections dask provides for writing regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. These high level collections provide a familiar API, but the execution model is very different. We'll discuss concepts like the GIL, serialization, and other headaches that come up with parallel programming. We’ll use dask’s various schedulers to illustrate the differences between multi-threaded, multi-processes, and distributed computing. Dask includes a distributed scheduler for executing task graphs on a cluster of machines. We’ll provide each person access to their own cluster.

Student Handout

No handouts have been provided yet for this tutorial