🤖 AI Summary
To address the surging energy consumption in interactive LLM inference services (e.g., chat assistants, code generation), this paper proposes an SLO-aware energy-efficient architecture. Methodologically, it introduces the first control-theoretic GPU frequency scaling mechanism coupled with a state-space model–based request routing strategy, enabling phase-level energy-efficiency co-optimization within prefill/decode decoupled execution. Deeply integrated into the SGLang system, the design supports multi-model deployment and real-world workloads. Experiments under stringent latency SLO constraints demonstrate up to 36.3% energy reduction while maintaining ≥99.8% SLO compliance. The core contribution lies in the novel integration of feedback control and state-space routing for LLM inference energy optimization—achieving fine-grained, high-accuracy, low-overhead, and SLO-guaranteed power regulation.
📝 Abstract
Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving.