xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning workloads in shared GPU clusters face significant challenges in accurately predicting GPU memory requirements: existing approaches either neglect runtime dynamism or rely on GPU instrumentation or intrusive code modifications, resulting in high out-of-memory (OOM) risk and low resource utilization. This paper introduces the first CPU-only, non-intrusive dynamic analysis framework for peak GPU memory prediction. It combines execution tracing, memory access pattern modeling, ANOVA-driven sensitivity analysis, and Monte Carlo simulation—requiring zero GPU overhead and no source-code modification. Evaluated across 25 models and 5,209 experiments, our method reduces median relative error by 91%, decreases OOM estimation failure rate by 75% under safety thresholds, and improves potential memory savings by 368%.

Technology Category

Application Category

📝 Abstract
The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements a priori. We conducted a thorough evaluation of xMem against state-of-the-art solutions using workloads from 25 different models, including architectures like Convolutional Neural Networks and Transformers. The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits: it decreases the median relative error by 91% and significantly reduces the probability of estimation failure as safe OOM thresholds by 75%, meaning that the estimated value can often be used directly without causing OOM. Ultimately, these improvements lead to a 368% increase in memory conservation potential over current solutions.
Problem

Research questions and friction points this paper is trying to address.

Estimating GPU memory requirements for deep learning training workloads
Addressing limitations of static analysis and historical data methods
Predicting dynamic memory needs without consuming GPU resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

CPU-based dynamic analysis for GPU memory estimation
Estimates peak GPU memory without GPU resource consumption
Non-intrusive approach requiring no code modifications
🔎 Similar Papers
No similar papers found.