Friday 4:15 p.m.–5 p.m.
Dask: A Pythonic Distributed Data Science Framework
Dask is a general purpose parallel computing system capable of Celery-like task scheduling, Spark-like big data computing, and Numpy/Pandas/Scikit-learn level complex algorithms, written in Pure Python. Dask has been adopted by the PyData community as a Big Data solution. This talk focuses on the distributed task scheduler that powers Dask when running on a cluster. We'll focus on how we built a Big Data computing system using the Python networking stack (Tornado/AsyncIO) in service of its data science stack (NumPy/Pandas/Scikit Learn). Additionally we'll talk about the challenges of effective task scheduling in a data science context (data locality, resilience, load balancing) and how we manage this dynamically with aggressive measurement and dynamic scheduling heuristics.