Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In modern GPU inference, cache efficiency is bottlenecked by low embedding table hit rates (in recommendation systems) and KV cache misses (in LLMs). Traditional heuristics like LRU fail to adapt to structured access patterns, while existing learning-based caching approaches suffer from either poor robustness—sharp performance degradation upon prediction errors—or conservative designs yielding limited gains and high overhead. Method: We propose LCR, a learning-based caching framework centered on the LARU algorithm, which dynamically fuses learned predictions with LRU via online error estimation. LARU approaches optimal performance when predictions are accurate and gracefully degrades to LRU baseline under misprediction, balancing efficiency and robustness. Contribution/Results: Evaluated on DLRM and LLM workloads, LCR achieves up to 24.2% higher throughput and 28.3% lower P99 time-to-first-token (TTFT), while maintaining stable performance under prediction failure.

Technology Category

Application Category

📝 Abstract
In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as extsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present extsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, extsc{LARU}, enhances extsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, extsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near- extsc{LRU} performance. With extsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that extsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2% and reduces P99 TTFT by up to 28.3%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.
Problem

Research questions and friction points this paper is trying to address.

Improving GPU cache efficiency for modern inference workloads
Addressing limitations of heuristic and learning-based caching policies
Ensuring robustness when ML predictions are inaccurate
Innovation

Methods, ideas, or system contributions that make the work stand out.

LARU enhances LRU with machine-learned predictions
Dynamically adapts to prediction accuracy online
Gracefully degrades to near-LRU if predictions inaccurate
🔎 Similar Papers
No similar papers found.
P
Peng Chen
Zhejiang University
J
Jiaji Zhang
Zhejiang University
Hailiang Zhao
Hailiang Zhao
ZJU 100 Young Professor, Zhejiang University
Service ComputingEdge ComputingLearning-Augmented Algorithms
Y
Yirong Zhang
Zhejiang University
J
Jiahong Yu
Zhejiang University
Xueyan Tang
Xueyan Tang
Industry Professor of Cryptography at Suzhou Institute of AI, SJTU
Blockchain SecurityCryptographyDecision-making science
Y
Yixuan Wang
Nanjing University of Aeronautics and Astronautics
H
Hao Li
Kuaishou
J
Jianping Zou
Kuaishou
G
Gang Xiong
Kuaishou
K
Kingsum Chow
Zhejiang University
Shuibing He
Shuibing He
Professor of Zhejiang University
Intelligent ComputingStorage SystemsProcessing-in-MemoryComputer Architecture
S
Shuiguang Deng
Zhejiang University