Talks: Transforming a Jupyter Notebook into a reproducible pipeline for ML experiments

Friday - April 21st, 2023 3:15 p.m.-3:45 p.m. in 255DEF

Presented by:


Experience Level:

Some experience

Description

Jupyter Notebooks are part of every data scientist's arsenal and for good reason. But while they're great for prototyping in data science projects, they are not ideal for experimenting with different configurations. I have been guilty of running experiments with changing parameters while keeping track on a notepad, and the result has always been messy.

In this session, we will explore how we can transform our notebook prototype into a reproducible pipeline. We will discuss what goes wrong without proper experiment tracking, why reproducibility is the key to solving this, and how we can achieve that with Git and DVC.

I will discuss this topic using a text2image project with Stable Diffusion. I'll show how to break up a notebook into modules, create a pipeline from them, run experiments through the pipeline, and compare their results to find the best possible outcomes.

The target audience will be data scientists that don't have a strong engineering background but would like to move beyond messing about in notebooks. Much like myself a year or two ago.