Problem statement
Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.
Taming file zoos with DuckDB
DuckDB fits perfectly with Python
The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.
Read and write files with confidence
DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.
Organize using the DuckDB format
Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.
Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!
Extensions
Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.
Takeaways
Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.