Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address resource imbalance, inefficient heterogeneous GPU utilization, and network bottlenecks in Prefill-Decode separated LLM inference architectures, this paper proposes the first metric-driven, cooperative autoscaling framework. It introduces a single robust metric that jointly governs both prefill and decode resource pools, integrated with topology-aware scheduling and production-grade monitoring signal analysis to enable cross-stage dynamic resource balancing and efficient elasticity. Evaluated across tens of thousands of GPUs in production, the framework achieves an average GPU utilization improvement of 26.6 percentage points, saves hundreds of thousands of GPU-hours daily, and strictly meets service-level objectives (SLOs). This work establishes a scalable, production-deployable, system-level optimization paradigm for large-scale LLM serving.

Technology Category

Application Category

📝 Abstract
Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.
Problem

Research questions and friction points this paper is trying to address.

Autoscaling for disaggregated LLM inference architectures
Managing heterogeneous hardware and network bottlenecks
Balancing prefill and decode stages efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coordinated autoscaling framework for disaggregated LLM inference
Topology-aware scheduler adapting to heterogeneous hardware constraints
Novel metric-driven policy for joint prefill-decode scaling
🔎 Similar Papers
No similar papers found.
R
Rongzhi Li
ByteDance Seed
R
Ruogu Du
ByteDance Seed
Z
Zefang Chu
ByteDance Seed
S
Sida Zhao
ByteDance Seed
Chunlei Han
Chunlei Han
ByteDance Seed
Z
Zuocheng Shi
ByteDance Seed
Yiwen Shao
Yiwen Shao
Johns Hopkins University
speech recognitionmachine learningdeep learningNatural Language Processing
H
Huanle Han
ByteDance Seed
Long Huang
Long Huang
Xi'an Jiaotong - Liverpool University
Heat TransferModelingOptimizationHVAC&R
Z
Zherui Liu
ByteDance Seed
S
Shufan Liu
ByteDance Seed