PyCon 2019 in Cleveland, Ohio

Saturday 10:50 a.m.–11:20 a.m. in Room 26A/B/C

Machine learning model and dataset versioning practices

Dmitry Petrov

Description

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models, and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space. We will discuss the current practices of organizing ML projects using traditional open-source toolset like Git and Git-LFS as well as this toolset limitation. Thereby motivation for developing new ML specific version control systems will be explained. Data Version Control or [DVC.ORG][1] is an [open source][2], command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects. [1]: http://dvc.org [2]: https://github.com/iterative/dvc