Talks

High-Performance LLM Inference in Pure Python with PyTorch Custom Ops

Saturday, May 16th, 2026 5 p.m.–5:30 p.m. in Grand Ballroom A

Presented by

Yineng Zhang

Description

This talk presents how a modern large language model (LLM) inference engine written entirely in Python can reach performance comparable to C++-based systems such as TensorRT-LLM. Based on experience maintaining SGLang—an open-source pure-Python inference engine, this session explains the design principles behind choosing a Python-first architecture and how it achieves high throughput and low latency in real deployments.

The core idea is that Python, supported by the evolving PyTorch runtime stack, can serve not only as orchestration code but as the foundation of a performant inference system. Three components make this possible: a multi-process scheduling framework for parallelism, a zero-overhead batch scheduler that hides CPU overhead behind GPU execution, and a set of high-performance PyTorch custom ops implemented directly in Python. Together, they keep GPUs fully utilized while maintaining a clean and flexible Python development experience.

This talk will cover:

  • Why a pure-Python architecture can match C++ inference engines

  • The key design principles behind multi-process runtime scheduling

  • How zero-overhead batch scheduling improves utilization and reduces latency

  • The role of PyTorch custom ops in high-performance serving

  • Insights and practical lessons learned from scaling a Python LLM engine

Attendees will gain a clear understanding of how Python can power high-performance LLM workloads and how these ideas can be applied to their own systems—without requiring C++ or CUDA expertise.

Search