NVIDIA has unveiled a new command-line utility, cuda-checkpoint
, aimed at enhancing the checkpoint and restore functionalities for CUDA applications on Linux. This utility, which can be used in conjunction with the open-source checkpointing tool CRIU (Checkpoint/Restore in Userspace), promises to streamline the process of preserving and restoring the state of CUDA applications.
Checkpointing Overview
Transparent, per-process checkpointing provides a balance between virtual machine checkpointing and application-driven checkpointing. It can be particularly useful in scenarios requiring fault tolerance, task preemption, or cluster scheduling with migration. By combining cuda-checkpoint
with CRIU, users can checkpoint the state of complex applications, thus facilitating greater flexibility and reliability in various computational tasks.
CRIU
CRIU, an open-source utility maintained outside of NVIDIA, is designed to checkpoint and restore Linux process trees. It handles various kernel mode resources such as anonymous memory, threads, regular files, sockets, and pipes. However, it lacks native support for NVIDIA GPUs, which is where cuda-checkpoint
comes into play, extending CRIU's capabilities to include CUDA state management.
cuda-checkpoint
The cuda-checkpoint
utility supports display driver version 550 and higher. It allows users to toggle the CUDA state of a process between suspended and running. The transition from running to suspended is termed as a suspend, while the reverse is termed as a resume. During suspension, CUDA driver APIs are locked, submitted CUDA work is completed, device memory is copied to the host, and all CUDA GPU resources are released. Conversely, during resumption, GPUs are re-acquired, device memory and GPU memory mappings are restored, CUDA objects are reinstated, and CUDA driver APIs are unlocked.
Checkpointing Example
An example application, counter
, demonstrates the checkpointing process. The application increments GPU memory upon receiving a packet and replies with the updated value. Users can build this application using nvcc
and observe the checkpointing and restoration processes using cuda-checkpoint
and CRIU commands.
Functionality and Limitations
As of display driver version 550, the cuda-checkpoint
utility is still under active development. Currently, it supports x64 architecture and acts on a single process rather than a process tree. It does not support UVM or IPC memory, GPU migration, and waits for already-submitted CUDA work to finish before completing a checkpoint. Future driver releases are expected to address these limitations without requiring updates to the utility itself.
Summary
The cuda-checkpoint
utility, in combination with CRIU, enables transparent per-process checkpointing of Linux applications. For further information, visit the official NVIDIA Technical Blog.