CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address two critical challenges in deep learning training—low GPU utilization and memory-overflow crashes or cross-task resource interference caused by task co-location—this paper proposes GPUMemNet, a GPU memory-aware scheduling framework designed for task-level co-location. Its core contributions include: (1) a machine learning–based fine-grained GPU memory consumption prediction model; (2) a dynamic upper-bound control mechanism for GPU utilization; and (3) a lightweight task-recovery fault-tolerance strategy. By jointly optimizing memory allocation and utilization caps, GPUMemNet enhances system robustness while significantly improving resource service quality and energy efficiency. Evaluation on real-world training task traces demonstrates that GPUMemNet increases temporal GPU utilization by 39.3%, reduces end-to-end training time by 26.7%, and lowers energy consumption by 14.2%.

Technology Category

Application Category

📝 Abstract

Studies conducted on enterprise-scale infrastructure have shown that GPUs -- the core computational resource for deep learning (DL) training -- are often significantly underutilized. DL task collocation on GPUs is an opportunity to address this challenge. However, it may result in (1) out-of-memory crashes for the subsequently arriving task and (2) slowdowns for all tasks sharing the GPU due to resource interference. The former challenge poses a threat to robustness, while the latter affects the quality of service and energy efficiency. We propose CARMA, a server-scale task-level collocation-aware resource management system that handles both collocation challenges. CARMA encompasses GPUMemNet, a novel ML-based GPU memory estimator framework for DL training tasks, to minimize out-of-memory errors and introduces collocation policies that cap GPU utilization to minimize interference. Furthermore, CARMA introduces a recovery method to ensure robust restart of tasks that crash. Our evaluation on traces modeled after real-world DL training task traces shows that CARMA increases the GPU utilization over time by 39.3%, decreases the end-to-end execution time by $sim$26.7%, and reduces the GPU energy use by $sim$14.2%.

Problem

Research questions and friction points this paper is trying to address.

Addresses GPU underutilization in deep learning training

Minimizes out-of-memory crashes during task collocation

Reduces performance interference between co-located GPU tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collocation-aware resource management system

ML-based GPU memory estimator framework

Collocation policies capping GPU utilization

🔎 Similar Papers

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments