- PyCon US 2023

Talks: Supercharging Pipeline Efficiency with ML Performance Prediction

Sunday - April 23rd, 2023 1:45 p.m.-2:15 p.m. in 255DEF

Presented by:

Experience Level:

Some experience

Description

To process our customers' data, Singular's pipeline runs hundreds of thousands of daily tasks, each with a different processing time and resource requirements. We deal with this scale by using Celery and Kubernetes as our tasks infrastructure, letting us allocate dedicated workers and queues to each type of task based on its requirements. Originally, this was configured manually.

As our customer base grew, we noticed that heavier and longer tasks were grabbing all the resources and causing unacceptable queues in our pipeline. Moreover, some of the heavier tasks required significantly more memory, leading to OOM kills and infrastructure issues.

If we could classify tasks by their expected duration and memory requirements, we could have segregated tasks in Celery based on these properties and thus minimized interruptions to the rest of the pipeline. However, the variance in the size and granularity of the fetched data made it impossible to classify if a task was about to take one minute or one hour.

Our challenge was: how do we categorize these tasks, accurately and automatically? To solve the issue we implemented a machine-learning model that could predict the expected duration and memory usage of a given task. Using Celery’s advanced task routing capabilities, we could then dynamically configure different task queues based on the model's prediction.

This raised another challenge - how could we use the classified queues in the best way? Configuring workers statically for each queue would be inadequate at scale. We utilized Kubernetes’ vertical and horizontal autoscaling capabilities to dynamically allocate workers for each classified queue based on its length. This improved our ability to respond to pipeline load automatically, increasing performance and availability. Additionally, we were able to deploy shorter-lived workers on AWS Spot instances, giving us higher performance while lowering cloud costs.