Tabular data is everywhere. As Python has become the language of choice for data science, pandas and scikit-learn have become staples in the machine learning (ML) toolkit for processing and modeling this data. However, when data size scales up, these tools become unwieldy (slow) or altogether untenable (running out of memory). Ibis provides a unified, Pythonic, dataframe interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Local backends, such as Polars, DuckDB, and DataFusion, perform orders of magnitude faster than pandas while using less memory. Ibis further enables users to scale using distributed backends like Spark or cloud data warehouses like Snowflake and BigQuery without changing their code, giving them the power to choose the right engine for any scale.
IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets users preprocess their data at scale on any Ibis-supported backend—users create IbisML recipes defining sequences of last-mile preprocessing steps to get their data ready for modeling. A recipe and any scikit-learn estimator can be chained together into a pipeline, so IbisML seamlessly integrates with scikit-learn, XGBoost (using the scikit-learn estimator interface), and PyTorch (using skorch) models. At inference time, Ibis/IbisML once again takes the feature preprocessing to the efficient backend, and user-defined functions (UDFs) enable prediction while minimizing data transfer. This completes an end-to-end ML workflow that scales with data size.
In this tutorial, you'll build an end-to-end ML project to predict the live win probability at any given move during a chess game. You’ll be using actual, recent games from the largest free chess server (Lichess). We’ll be using GitHub Codespaces, so you don’t need to download or configure anything; however, please bring a laptop!