NVIDIA NCCL 2.28 Revolutionizes GPU Communication with New Device API
Rebeca Moen Nov 10, 2025 23:56
NVIDIA's latest NCCL 2.28 release introduces a device API, enhancing communication and computation fusion for GPU networks, boosting performance and efficiency.
The NVIDIA Collective Communications Library (NCCL) has introduced its latest version, NCCL 2.28, a significant leap forward in GPU communication technology. This update focuses on the fusion of communication and computation, aiming to enhance throughput, reduce latency, and maximize GPU utilization across multi-GPU and multi-node systems, according to NVIDIA.
Key Features of NCCL 2.28
NCCL 2.28 brings several new features, including GPU-initiated networking, device APIs for communication-compute fusion, and copy-engine-based collectives. These innovations are designed to empower developers to create efficient, scalable distributed applications. The release also includes expanded APIs, improved tooling, and cleaner integration paths, facilitating the development of custom communication kernels.
Device API and Copy Engine Collectives
The new device API allows for the development of custom device kernels that integrate communication within NVIDIA CUDA kernels, removing the need for host-initiated operations. This integration reduces synchronization overhead, thus increasing throughput and reducing latency. Three operation modes are introduced: Load/Store Accessible (LSA), Multimem, and GPU Initiated Networking (GIN), each supporting different communication scenarios.
Moreover, the copy engine-based collectives enable efficient NVLink transfers by offloading communication tasks from streaming multiprocessors (SMs) to dedicated hardware. This approach minimizes resource contention, allowing simultaneous execution of communication and computation tasks.
NCCL Inspector for Enhanced Profiling
The NCCL Inspector, a new profiling tool, provides always-on observability and analysis of NCCL communication patterns. It offers detailed performance and metadata logging, allowing developers to analyze and debug collective operations efficiently. The plugin tracks each NCCL communicator individually, offering insights into performance patterns across different communication contexts.
Developer Experience Improvements
NCCL 2.28 enhances the developer experience with new APIs for operations like AllToAll, Gather, and Scatter. It introduces flexible configuration management through an environment plugin API, facilitating programmatic version matching and configuration storage agnostic setups. Additionally, the release supports CMake for Linux builds, streamlining integration into larger build pipelines.
For further details on NCCL 2.28 and its features, visit the official NVIDIA blog.
Image source: Shutterstock