Tutorials: Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets

Wednesday - April 19th, 2023 9 a.m.-12:30 p.m. in 250F

Presented by:


Experience Level:

Some experience

Description

While most folks aren't at the scale of cloud giants or black hole research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud – starting from how the data is stored and read, to how it is processed and visualized.

You will understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data. By the end, you will be able to answer:

What makes some data formats more efficient at scale? Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work? How to manage cloud storage, resources, and costs effectively? How interactive visualization can make large and complex data more understandable (primarily with hvPlot)? How to comfortably collaborate on data science projects with your entire team?

The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding within three hours.