Saturday 10:50 a.m.–11:20 a.m. in Room 26A/B/C

Democratizing Distributed Computing with Dask and JupyterHub

Matthew Rocklin


We use JupyterHub, XArray, Dask, and Kubernetes to build a cloud-based system to enable scientists to analyze and manage large datasets. We use this in practice to serve a broad community of atmospheric and climate scientists. Atmospheric and climate scientists analyze large volumes of observational and simulated data to better understand our planet. They have historically used tools like NumPy and SciPy along with Jupyter notebooks to combine efficient computation with accessibility. However, as datasets increase in size and collaboration extends to new populations of scientists these tools begin to feel their age. In this talk we use more recent libraries to build a modern deployment for academic scientists. In particular we use the following tools: - **Dask:** to parallelize and scale NumPy computations - **XArray**: as a self-discribing data model and tool kit for labeled and index arrays - **JupyterLab:** to enable more APIs for users beyond the classic notebook - **JupyterHub:** to manage users and maintain environments for a new population of cloud-friendly users - **Kubernetes:** to manage everything and deploy easily on cloud hardware This talk will focus less on how these libraries work and will instead be a case study of using them together in an operational setting. During the talk we will build up and deploy a running system that the audience can then use to access distributed computing resources.