Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficient deployment of large language models requires evaluating numerous serving configurations, yet conducting such evaluations on real GPU clusters is prohibitively expensive. To address this challenge, this work proposes a transparent simulation methodology that requires no modification to the serving codebase. By intercepting CUDA calls, virtualizing GPU devices, and incorporating kernel execution time prediction, the approach leverages time-warping simulation to fast-forward virtual time in GPU-free environments. A cross-process time-jump coordination protocol ensures causal consistency without necessitating rewrites of control logic. Experimental results on vLLM and SGLang demonstrate that the method achieves performance prediction errors below 5% across diverse models and parallelism configurations, while accelerating simulation by 5–17× compared to real GPU execution.

Technology Category

Application Category

📝 Abstract
Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system's control logic -- a burden that compounds as frameworks evolve. We present Revati, a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs time jumps -- fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, Revati achieves less than 5% prediction error across multiple models and parallelism configurations, while running 5-17x faster than real GPU execution.
Problem

Research questions and friction points this paper is trying to address.

LLM serving
performance evaluation
GPU simulation
time-warp emulation
serving configuration
Innovation

Methods, ideas, or system contributions that make the work stand out.

time-warp emulation
GPU-free simulation
LLM serving
CUDA virtualization
distributed causality
🔎 Similar Papers
No similar papers found.
Amey Agrawal
Amey Agrawal
PhD Student at Georgia Tech
Systems for AI
M
Mayank Yadav
Georgia Institute of Technology
S
Sukrit Kumar
Georgia Institute of Technology
A
Anirudha Agrawal
Georgia Institute of Technology
G
Garv Ghai
Georgia Institute of Technology
S
Souradeep Bera
Georgia Institute of Technology
E
Elton Pinto
Georgia Institute of Technology
S
Sirish Gambhira
Georgia Institute of Technology
M
Mohammad Adain
Georgia Institute of Technology
K
Kasra Sohrab
Georgia Institute of Technology
C
Chus Antonanzas
Georgia Institute of Technology
Alexey Tumanov
Alexey Tumanov
Associate Professor, Georgia Institute of Technology
Systems for MLsoft real-time MLLLM inference