Talks: Building NumPy Arrays from CSV Files, Faster than Pandas
Friday - April 21st, 2023 12:15 p.m.-12:45 p.m. in 255DEF
Twenty years ago, in 2003, Python 2.3 was released with
csv.reader(), a function that provided support for parsing CSV files. The C implementation, proposed in PEP 305, defines a core tokenizer that has been a reference for many subsequent projects. Two commonly needed features, however, were not addressed in
csv.reader(): determining type per column, and converting strings to those types (or columns to arrays). Pandas
read_csv() implements automatic type conversion and realization of columns as NumPy arrays (delivered in a DataFrame), with performance good enough to be widely regarded as a benchmark. Pandas implementation, however, does not support all NumPy dtypes. While NumPy offers
genfromtxt() for similar purposes, the former (recently re-implemented in C) does not implement automatic type discovery, while the latter (implemented in Python) suffers poor performance at scale.
To support reading delimited files in StaticFrame (a DataFrame library built on an immutable data model), I needed something different: the full configuration options of Python's
csv.reader(); optional type discovery for one or more columns; support for all NumPy dtypes; and performance competitive with Pandas
Following the twenty-year tradition of extending
csv.reader(), I implemented
delimited_to_arrays() as a C extension to meet these needs. Using a family of C functions and structs, Unicode code points are collected per column (with optional type discovery), converted to C-types, and written into NumPy arrays, all with minimal
PyObject creation or reference counting. Incorporated in StaticFrame, performance tests across a range of DataFrame shapes and type heterogeneity show significant performance advantages over Pandas. Independent of usage in StaticFrame,
delimited_to_arrays() provides a powerful new resource for converting CSV files to NumPy arrays. This presentation will review the background, architecture, and performance characteristics of this new implementation.