Using tools like PyArrow, Pandas, or Polars makes it easy to work with dataframes. However, as datasets scale to terabytes, managing tables, evolving schemas, and ensuring consistency across tools becomes increasingly complex. Apache Iceberg™, an open table format, addresses these challenges, and, with PyIceberg, seamlessly integrates with your favorite Python-based tools.
We will start with an introduction to Iceberg and PyIceberg, focusing on the features PyIceberg brings to the Python ecosystem such as schema evolution and transactional guarantees. We will demonstrate how PyIceberg supports interoperability between Iceberg tables and Python-native dataframes like PyArrow and Pandas, using practical examples of creating, querying, and writing to Iceberg tables.
From these practical examples, we will dive deeper to explore how Iceberg tables evolve during these operations. This includes an in-depth look at Iceberg's file structure—metadata files, manifest lists, and manifests—and how PyIceberg leverages this structure to perform transactional table updates and optimize query planning, ensuring reliable performance at scale.
Finally, we will discuss PyIceberg’s advanced features, including schema evolution, hidden partitioning, and time travel, which make table management efficient and flexible.