Talks

Processing Large Geospatial Datasets with Dask & Xarray

Friday, May 16th, 2025 12:30 p.m.–1 p.m. in Ballroom A

Presented by

Patrick Hoefler

Experience Level:

Advance experience

Description

Geospatial datasets are growing in size, often exceeding 100TB and reaching into Petabyte scale. Many of these datasets are publicly available, providing a great resource for analysis, but working with them requires increasingly large computational resources and a diverse set of tools.

We will start by briefly introducing Dask and Xarray, which form the backbone of the geospatial stack in Python. Using the ERA5 dataset as a case study, we will demonstrate how Xarray can be used to explore large-scale climate data effectively from your local laptop.

Building on this foundation, we will delve into recent advancements in Dask Array. Originally designed as a parallel NumPy API, Dask Array was used to handle much larger datasets over the last few years. We’ll explore the latest developments in Dask and Xarray that continue to expand the scalability and capabilities of these tools to catch up with the scale requirements of modern datasets.

This discussion will highlight improvements in ease of use, scalability, and performance. Additionally, we’ll present the first-ever set of geospatial benchmarks, collected earlier in 2024 from the community. These benchmarks provide a clear illustration of the scale at which Xarray and Dask are required to operate.

Finally, we’ll offer a peak behind the scenes of an ongoing project aimed at building the first ever query optimizer for large scale array computations.

Search