VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the surging energy consumption in interactive LLM inference services (e.g., chat assistants, code generation), this paper proposes an SLO-aware energy-efficient architecture. Methodologically, it introduces the first control-theoretic GPU frequency scaling mechanism coupled with a state-space model–based request routing strategy, enabling phase-level energy-efficiency co-optimization within prefill/decode decoupled execution. Deeply integrated into the SGLang system, the design supports multi-model deployment and real-world workloads. Experiments under stringent latency SLO constraints demonstrate up to 36.3% energy reduction while maintaining ≥99.8% SLO compliance. The core contribution lies in the novel integration of feedback control and state-space routing for LLM inference energy optimization—achieving fine-grained, high-accuracy, low-overhead, and SLO-guaranteed power regulation.

Technology Category

Application Category

📝 Abstract

Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving.

Problem

Research questions and friction points this paper is trying to address.

Reducing energy costs in LLM serving systems

Optimizing frequency scaling and request routing

Maintaining latency constraints while saving energy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback-driven GPU frequency control for phases

State-space routing across scaled instances

Prefill/decode disaggregated architecture co-design

🔎 Similar Papers

No similar papers found.