Talks: Data Processing on Ray

Presented by:


Description

Brief Description (To be used on conference website and advertising)

Machine learning and data processing applications continue to drive the need to develop scalable Python applications. Ray is a distributed execution engine that enables programmers to scale up their Python applications.

This talk will cover some of the challenges we faced and key architectural changes we made to Ray over the past year to support a new set of large scale data processing workloads.

Detailed Abstract

Python has been the number one choice for developers to build machine learning applications thanks to its mature ecosystems and simplicity. However, due to growing compute requirements for machine learning and data processing workload, scaling Python applications has become a necessity.

Ray has been one of the favorite choices for developers to scale their Python applications due to its simple yet powerful APIs. Its growing popularity has established robust ecosystems for scalable machine learning, such as integration to popular machine learning libraries like Horovod, spaCy, Huggingface, XGBoost, and scikit-learn as well as its native libraries like Rllib, Ray Tune, and Ray Serve.

However, machine learning is not the only workload that needs to scale. Scaling data processing workload is one of the most important parts of machine learning pipelines. The status quo for many teams today is to use workflow orchestrators such as Airflow or Kubeflow to stitch distributed systems together, which add costs and increase the management complexity.

Ray being able to support large scale data processing is important as it will allow people to build a single system with multiple distributed “libraries”, which will reduce the operational burden and complexity of machine learning pipelines. We will introduce Ray’s recent improvements that make Ray a strong system for large-scale data processing workloads and its ecosystems. The talk will end up with code examples that connect machine learning and data processing workloads into a single python application.