Talks

GPU Communications for Python

Friday, May 15th, 2026 11 a.m.–11:30 a.m. in Room 104AB

Add to Calendar

Presented by

Benjamin Glick

Experience Level:

Advance experience

Description

The AI/ML and HPC ecosystems share characteristics that make them difficult to serve:

They both require specialized orchestration of many processors across systems,
They both make use of hardware acceleration and
They both have user communities who overwhelmingly prefer Python to other languages.

This session covers NVSHMEM4Py and NCCL4Py, libraries which make GPU-centric communication accessible to Python programmers without compromising performance or usability. NVSHMEM and NCCL are GPU-centric communication libraries primarily exposed through CUDA C++. These libraries address 3 common use cases:

Performing collectives from GPU memory, using GPU cores and for mathematical operations,
GPU to GPU communication with low-latency and high-bandwidth, and
Enabling custom communication patterns and fusing communication with computation

NVSHMEM4Py and NCCL4Py offer this in Python, so AI/HPC practitioners can achieve high performance in multi-GPU programs without leaving Python. Our host APIs integrate with the cuda-Python ecosystem with array-oriented memory APIs. Point-to-point operations (put/get) and collectives (reduce) execute on CUDA streams, enabling overlapping communication with computation.

Device APIs expose GPU-initiated communication primitives into Python DSLs such as Numba and CuTe. Developers invoke GPU-resident operations (e.g., device_put) in the DSL, enabling custom/fused kernels in Python. For example, a user kernel can perform partial computation and push results to peer GPUs without host intervention. To generate the code and link Python to CUDA C++, we use Numbast to generate 1:1 Python bindings of device functions. We then write DSL code wrapping around those bindings for a Pythonic experience. At runtime, we JIT-compile the Python functions to LTO-IR, and link them against CUDA objects.

The presentation covers the details of our host and device APIs, some applications using them, and performance evaluations comparing the NVSHMEM/NCCL APIs across CUDA and Python.