NVIDIA has unveiled new capabilities for its DOCA GPUNetIO library, enabling GPU-accelerated Remote Direct Memory Access (RDMA) for real-time inline GPU packet processing. This enhancement leverages technologies such as GPUDirect RDMA and GPUDirect Async, allowing a CUDA kernel to directly communicate with the network interface card (NIC), bypassing the CPU. This update aims to improve GPU-centric applications by reducing latency and CPU utilization, according to the NVIDIA Technical Blog.
Enhanced RDMA Functionality
Previously, DOCA GPUNetIO, along with DOCA Ethernet and DOCA Flow, was used for packet transmissions over the Ethernet transport layer. The latest update, DOCA 2.7, introduces a new set of APIs that enable RDMA communications directly from a GPU CUDA kernel using RoCE or InfiniBand transport layers. This development allows for high-throughput, low-latency data transfers by enabling the GPU to control the data path of the RDMA application.
RDMA GPU Data Path
RDMA allows direct access between the main memory of two hosts without involving the operating system, cache, or storage. This is achieved by registering and sharing a local memory area with the remote host, enabling high-throughput and low-latency data transfers. The process involves three fundamental steps: local configuration, exchange of information, and data path execution.
With the new GPUNetIO RDMA functions, the application can manage the data path of the RDMA application on the GPU, executing the data path step with a CUDA kernel instead of the CPU. This reduces latency and frees up CPU cycles, allowing the GPU to be the main controller of the application.
Performance Comparison
NVIDIA has conducted performance comparisons between GPUNetIO RDMA functions and IB Verbs RDMA functions using the perftest microbenchmark suite. The tests were executed on a Dell R750 machine with an NVIDIA H100 GPU and a ConnectX-7 network card. The results show that DOCA GPUNetIO RDMA performance is comparable to IB Verbs perftest, with both methods achieving similar peak bandwidth and elapsed times.
For the performance tests, parameters were set to 1 RDMA queue, 2,048 iterations, and 512 RDMA writes per iteration, with message sizes ranging from 64 to 4,096 bytes. Both implementations reached up to 16 GB/s in peak bandwidth when increased to four queues, demonstrating the scalability and efficiency of the new GPUNetIO RDMA functions.
Benefits and Applications
The architectural choice of offloading RDMA data path control to the GPU offers several benefits:
- Scalability: Capable of managing multiple RDMA queues in parallel.
- Parallelism: High degree of parallelism with several CUDA threads working simultaneously.
- Lower CPU Utilization: Platform-independent performance with minimal CPU involvement.
- Reduced Bus Transactions: Fewer internal bus transactions, as the CPU is no longer responsible for data synchronization.
This update is particularly beneficial for network applications where data processing occurs on the GPU, enabling more efficient and scalable solutions. For more details, visit the NVIDIA Technical Blog.
Image source: Shutterstock