PyCon 2022

Sponsor Workshops: Database for AI: the future of handing off data to compute (Sponsor: Activeloop)

Presented by:

Davit Buniatyan

Description

Managing unstructured data like videos, images, and text for deep learning purposes is becoming increasingly more complex. Nonetheless, current solutions for storing the data we create (databases, data lakes, and data warehouses) are unsuitable for storing computer vision data. There are numerous file formats, compression techniques, and multiple locations to attend to. As the dataset size grows, it becomes highly cumbersome and time-consuming to manage, often leaving the data siloed. A database for AI, designed specifically for AI, solves this problem.

Participants will learn about the Database for AI, a data-centric framework utilizing a format designed specifically for streaming data for deep learning. Effectively, computer vision data is stored on the cloud and streamed in a format directly used in training deep learning models. This enables (1) creating, storing, and collaborating on AI datasets of any size; (2) rapidly transforming and streaming data while training models at scale and (3) instant exploration and visualization of datasets for AI regardless of their size and storage location.

To illustrate the benefits of the framework, we demonstrate one such use case. The novel approach is superior to mechanisms provided by AWS SageMaker (like File mode, and Fast File mode). Streaming data using our approach was found to be virtually equal to reading off of a disk. The case study was conducted by training ResNet-50 on ImageNet data for a few epochs. This cloud-native is 2x faster than the traditional methods (storing data in a file system), and results in near-full utilization of v100 GPUs at an average rate of 95%.

Effectively, we demonstrate that by adopting the AI-native database, data scientists can get started with their work without the need to download the data first to start training, and hand off data to compute more efficiently.