OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the system-level challenges in large language model (LLM) serving—namely, compute intensity, latency sensitivity, and throughput bottlenecks—this paper introduces OmniServe, a unified end-to-end acceleration framework. OmniServe innovatively integrates four core techniques: load-aware dynamic expert placement, disaggregation-aware request scheduling, sparse attention acceleration (OmniAttn), and cache-aware proxying (OmniProxy), enabling holistic optimization across both prefill and decoding phases. Built atop vLLM, it supports adaptive resource disaggregation and fine-grained cache compression (OmniPlacement). Evaluated on a 10-node Ascend 910C cluster, OmniServe achieves 616 queries per minute (QPM), reduces time-per-output-token (TPOT) by 36%, and cuts time-to-first-token (TTFT) by 38%. These results demonstrate substantial improvements in LLM serving efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36%, and the superimposition of OmniProxy further slashes TTFT by 38%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).
Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM serving throughput and latency system-wide
Addresses computation intensity and throughput bottlenecks in serving
Enhances efficiency via expert placement, cache compression, scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified system-level acceleration framework for LLM serving
Integrates load-aware MoE scheduling and sparse attention acceleration
Uses adaptive resource disaggregation and global coordination
🔎 Similar Papers
No similar papers found.
J
Jun Wang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Y
Yunxiang Yao
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
W
Wenwei Kuang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
R
Runze Mao
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Z
Zhenhao Sun
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Z
Zhuang Tao
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Z
Ziyang Zhang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
D
Dengyu Li
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
J
Jiajun Chen
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Z
Zhili Wang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
Kai Cui
Kai Cui
Technische Universität Darmstadt
Mean Field GamesReinforcement LearningLLM Inference
C
Congzhi Cai
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
L
Longwen Lan
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
K
Ken Zhang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.