🤖 AI Summary
To address the system-level challenges in large language model (LLM) serving—namely, compute intensity, latency sensitivity, and throughput bottlenecks—this paper introduces OmniServe, a unified end-to-end acceleration framework. OmniServe innovatively integrates four core techniques: load-aware dynamic expert placement, disaggregation-aware request scheduling, sparse attention acceleration (OmniAttn), and cache-aware proxying (OmniProxy), enabling holistic optimization across both prefill and decoding phases. Built atop vLLM, it supports adaptive resource disaggregation and fine-grained cache compression (OmniPlacement). Evaluated on a 10-node Ascend 910C cluster, OmniServe achieves 616 queries per minute (QPM), reduces time-per-output-token (TPOT) by 36%, and cuts time-to-first-token (TTFT) by 38%. These results demonstrate substantial improvements in LLM serving efficiency and scalability.
📝 Abstract
Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36%, and the superimposition of OmniProxy further slashes TTFT by 38%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).