Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Efficient deployment of large language models requires evaluating numerous serving configurations, yet conducting such evaluations on real GPU clusters is prohibitively expensive. To address this challenge, this work proposes a transparent simulation methodology that requires no modification to the serving codebase. By intercepting CUDA calls, virtualizing GPU devices, and incorporating kernel execution time prediction, the approach leverages time-warping simulation to fast-forward virtual time in GPU-free environments. A cross-process time-jump coordination protocol ensures causal consistency without necessitating rewrites of control logic. Experimental results on vLLM and SGLang demonstrate that the method achieves performance prediction errors below 5% across diverse models and parallelism configurations, while accelerating simulation by 5–17× compared to real GPU execution.

Technology Category

Application Category

📝 Abstract

Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system's control logic -- a burden that compounds as frameworks evolve. We present Revati, a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs time jumps -- fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, Revati achieves less than 5% prediction error across multiple models and parallelism configurations, while running 5-17x faster than real GPU execution.

Problem

Research questions and friction points this paper is trying to address.

LLM serving

performance evaluation

GPU simulation

time-warp emulation

serving configuration

Innovation

Methods, ideas, or system contributions that make the work stand out.

time-warp emulation

GPU-free simulation

LLM serving

CUDA virtualization

distributed causality

🔎 Similar Papers

No similar papers found.

Authors to Follow