PyCon 2022

Talks: Speed Up Data Access with PyArrow (Apache Arrow) - Data is the new API

Presented by:

Deepak K Gupta

Description

Till now we’re used to accessing data over API’s and the API’s used to make sure that we get the data in the desired format which unfortunately requires data to go through serialization / deserialization cycle before being returned by the API.

What if we can change or arrange the data in such a way where it neither needs an API nor any serialisation / deserialization to access and understand the data that too using multiple programming languages?

If it sounds interesting then welcome to the world of Apache Arrow which defines a language independent columnar memory format which supports zero-copy reads for lightning-fast data access without serialization overhead.

The python library of the same is called PyArrow and can be integrated with python specific libraries like pandas and numpy and can propagate the benefits to the same.

Welcome to this talk where you’ll learn about the architecture, use cases and reasons for using Apache Arrow using PyArrow. I’ll share how to as well as some of the interesting statistics of the difference it makes in our day to day access & analytics.

I’ll also talk about Apache Flight, which is a high performance wire protocol focused on bulk transfer for analytics.

This Session NOT a tutorial about PyArrow but a set of interesting improvements, facts and statistics which can help you to decide whether it makes sense to explore for the work you’re doing.