Talks

Taming file zoos: Data science with DuckDB database files

Sunday, May 18th, 2025 1:45 p.m.–2:15 p.m. in Ballroom A

Presented by

Experience Level:

Some experience

Description

Problem statement

Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.

Read and write files with confidence

DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.

Organize using the DuckDB format

Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.

Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!

Extensions

Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.

Takeaways

Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.