Talks: Python and SQL: Better Together, Powered by DuckDB

Saturday - May 18th, 2024 3:15 p.m.-3:45 p.m. in Ballroom A

Presented by:

Description

Data management systems and data scientists have a troubled relationship: Common systems such as Postgres or Spark are hard to set up and maintain, hard to transfer data into and out of, and hard to integrate into Python workflows. In response, data scientists have developed their own data wrangling tools including Pandas and Polars. These tools are more natural to use, but are limited in the amount of data they can process and the amount of automatic optimization.

DuckDB is a novel database management system purpose-built to combine cutting edge data processing and dataframe-like ease of use. DuckDB integrates deeply with the Python data analytics and engineering ecosystems - its Python client has over 1.5 million downloads each month. DuckDB can read and write popular Python dataframe libraries: Pandas, Polars, and Apache Arrow. DuckDB can even query Pandas dataframes faster than Pandas itself. Beyond dataframes, reading and writing is supported for Postgres, MySQL, and SQLite databases, and across many file formats (even on cloud object storage).

In addition to the friendliest SQL dialect in the world, DuckDB provides options for using Pythonic dataframe syntax directly on top of the database engine. DuckDB includes a relational API, an experimental PySpark-compatible API, and is the default engine for the Ibis portable dataframe library.

DuckDB supports complex queries, is MIT licensed, and has no external dependencies - it is a single pip install away! It is fast, easy to install and use, and handles larger than RAM datasets. Since DuckDB runs in the same process as the Python interpreter, no socket communication has to occur, making data transfer virtually instantaneous.

In our talk, we will describe DuckDB, compare it with Python dataframe libraries, and show how to combine DuckDB and dataframes for fast and easy data processing.