๐ค AI Summary
Large-scale deep learning on multi-tenant GPU clusters frequently suffers from job interruptions due to resource preemption, necessitating efficient and transparent checkpointing. Existing approaches rely on API interception and replay, incurring substantial runtime overhead and lacking cross-platform compatibility and native container integration. This paper introduces CRIUgpuโthe first transparent checkpointing system that leverages newly exposed GPU vendor interfaces for memory and context export (supporting both CUDA and ROCm) and is deeply integrated with the Linux CRIU framework. Unlike prior work, CRIUgpu avoids API interception entirely, enabling kernel-level GPU device state capture and seamless containerized deployment. It achieves zero steady-state performance overhead. Evaluated on multi-GPU training and HPC workloads, CRIUgpu demonstrates robust correctness and significantly reduces recovery time compared to existing transparent checkpointing solutions.
๐ Abstract
Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because of the hardware architecture differences between CPU and GPU, including memory subsystems, dynamic parallelism, and thread synchronization. State-of-the-art GPU checkpointing techniques typically leverage mechanisms that intercept, log, and replay device API calls. However, this approach adds performance overhead and requires hardware-specific implementation that is difficult to test, maintain, and integrate with existing container platforms. In this paper, we present CRIUgpu - a novel approach for transparent checkpointing of GPU-accelerated workloads that builds on recently introduced driver capabilities, enabling support for CUDA and ROCm applications. Our evaluation results show that CRIUgpu works with a variety of deep learning and high-performance computing workloads running across multiple GPUs, completely eliminating steady-state performance overheads, and significantly reducing recovery times compared to state-of-the-art transparent GPU checkpointing mechanisms.