The AI/ML and HPC ecosystems share characteristics that make them difficult to serve:
- They both require specialized orchestration of many processors across systems,
- They both make use of hardware acceleration and
- They both have user communities who overwhelmingly prefer Python to other languages.
This session covers NVSHMEM4Py and NCCL4Py, libraries which make GPU-centric communication accessible to Python programmers without compromising performance or usability. NVSHMEM and NCCL are GPU-centric communication libraries primarily exposed through CUDA C++. These libraries address 3 common use cases:
- Performing collectives from GPU memory, using GPU cores and for mathematical operations,
- GPU to GPU communication with low-latency and high-bandwidth, and
- Enabling custom communication patterns and fusing communication with computation
NVSHMEM4Py and NCCL4Py offer this in Python, so AI/HPC practitioners can achieve high performance in multi-GPU programs without leaving Python. Our host APIs integrate with the cuda-Python ecosystem with array-oriented memory APIs. Point-to-point operations (put/get) and collectives (reduce) execute on CUDA streams, enabling overlapping communication with computation.
Device APIs expose GPU-initiated communication primitives into Python DSLs such as Numba and CuTe. Developers invoke GPU-resident operations (e.g., device_put) in the DSL, enabling custom/fused kernels in Python. For example, a user kernel can perform partial computation and push results to peer GPUs without host intervention. To generate the code and link Python to CUDA C++, we use Numbast to generate 1:1 Python bindings of device functions. We then write DSL code wrapping around those bindings for a Pythonic experience. At runtime, we JIT-compile the Python functions to LTO-IR, and link them against CUDA objects.
The presentation covers the details of our host and device APIs, some applications using them, and performance evaluations comparing the NVSHMEM/NCCL APIs across CUDA and Python.