Talks

GPU Communications for Python

Friday, May 15th, 2026 11 a.m.–11:30 a.m. in Room 104AB

Presented by

Benjamin Glick

Description

The AI/ML and HPC ecosystems share characteristics that make them difficult to serve:

  • They both require specialized orchestration of many processors across systems,
  • They both make use of hardware acceleration and
  • They both have user communities who overwhelmingly prefer Python to other languages.

This session covers NVSHMEM4Py and NCCL4Py, libraries which make GPU-centric communication accessible to Python programmers without compromising performance or usability. NVSHMEM and NCCL are GPU-centric communication libraries primarily exposed through CUDA C++. These libraries address 3 common use cases:

  • Performing collectives from GPU memory, using GPU cores and for mathematical operations,
  • GPU to GPU communication with low-latency and high-bandwidth, and
  • Enabling custom communication patterns and fusing communication with computation

NVSHMEM4Py and NCCL4Py offer this in Python, so AI/HPC practitioners can achieve high performance in multi-GPU programs without leaving Python. Our host APIs integrate with the cuda-Python ecosystem with array-oriented memory APIs. Point-to-point operations (put/get) and collectives (reduce) execute on CUDA streams, enabling overlapping communication with computation.

Device APIs expose GPU-initiated communication primitives into Python DSLs such as Numba and CuTe. Developers invoke GPU-resident operations (e.g., device_put) in the DSL, enabling custom/fused kernels in Python. For example, a user kernel can perform partial computation and push results to peer GPUs without host intervention. To generate the code and link Python to CUDA C++, we use Numbast to generate 1:1 Python bindings of device functions. We then write DSL code wrapping around those bindings for a Pythonic experience. At runtime, we JIT-compile the Python functions to LTO-IR, and link them against CUDA objects.

The presentation covers the details of our host and device APIs, some applications using them, and performance evaluations comparing the NVSHMEM/NCCL APIs across CUDA and Python.

Search