Data pipelines are vital for moving data from source to destination. They help with use cases like integrating multimodal data, building a data warehouse, improving data quality, and more.
Over many years, developers have developed many design patterns or best practices for building data pipelines in Python using Pandas, Numpy, and more. However, one crucial criterion for creating a data pipeline is idempotency.
This talk will open with a brief overview of data pipelines and the importance of idempotency in distributed systems. We’ll look to answer the question: What does it take to build an idempotent data pipeline in Python with an example?
Our exploration will begin with the pitfalls of non-idempotent pipelines, then proceed to a methodology for building idempotent data pipelines and the design decisions that accompany them. Along the way, we’ll explore testing strategies using pytest.
This talk is aimed at those interested in building idempotent data pipelines.