FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
Current NPU deployments lack the ability to dynamically schedule between the compute-intensive prefill phase and the memory-bandwidth-constrained decode phase in large language model (LLM) inference, leading to imbalanced resource utilization. This work proposes a user-space transparent NPU virtualization architecture that intercepts AscendCL API calls and routes them to a device daemon, decoupling applications from physical hardware without requiring modifications to models, frameworks, or drivers. Building upon this, we design a phase-aware dynamic prefill-decode co-scheduling mechanism that leverages the complementary resource demands of the two phases to optimize execution strategies in real time. Evaluated on a 384-card Ascend 910C cluster, our approach improves throughput by 5.15%–26.33% over static prefill-decode separation for DeepSeek-R1, while for Qwen2.5-7B it reduces time-to-first-token (TTFT) by over 92% with nearly unchanged time-per-output-token (TPOT) and negligible inference overhead.
📝 Abstract
Modern AI serving increasingly relies on NPUs for conventional inference and large language model serving. However, current NPU deployments commonly expose physical devices directly to applications, which limits runtime control over scheduling and makes it difficult to adapt execution to phase-level workload behavior. This limitation is particularly evident in LLM serving, where the prefill phase is compute-intensive while the decode phase is often constrained by memory bandwidth and KV-cache accesses. Static prefill-decode (PD) disaggregation reduces phase interference, but can introduce resource imbalance and unnecessary data movement. We present FlexNPU, a transparent user-space virtualization layer for Ascend NPUs. FlexNPU interposes on AscendCL APIs and routes NPU operations through per-device daemons, decoupling unmodified from physical NPU devices without modifying model code, AI frameworks, or NPU drivers. This runtime boundary allows FlexNPU to virtualize NPU objects, control operator dispatch, and support phase-aware scheduling for LLM serving. In particular, FlexNPU enables dynamic PD co-location, which adapts scheduling between prefill and decode according to their complementary resource characteristics. We implement FlexNPU on Huawei Ascend NPUs and evaluate it with typical LLM workloads. Compared with direct NPU passthrough, FlexNPU introduces no measurable inference overhead and slightly improves throughput in some scenarios. On a 384-card Ascend 910C deployment of DeepSeek-R1, FlexNPU improves throughput over static PD disaggregation by 5.15% and 26.33%. On Qwen2.5-7B, compared with static PD co-location, FlexNPU maintains comparable throughput while reducing TTFT by over 92% across tested workloads with nearly unchanged TPOT. These results show that transparent NPU virtualization is a practical substrate for efficient and responsive LLM serving.
Problem

Research questions and friction points this paper is trying to address.

NPU virtualization
LLM serving
prefill-decode scheduling
resource imbalance
phase-aware execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU virtualization
LLM serving
dynamic prefill-decode co-location
phase-aware scheduling
Ascend NPU
J
Jiongjiong Gu
Huawei Technologies Co., Ltd, Shenzhen, China 518129
J
Jianfeng Wang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Z
Zidong Han
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107
Y
Yongqiao Wang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Pengfei Xia
Pengfei Xia
Huawei Technologies
Deep LearningAdversarial ExamplesBackdoor Learning
Mingjie Zhang
Mingjie Zhang
MPhil Student, The Hong Kong University of Science and Technology (Guangzhou)
RoboticsVision-Language Navigation
H
Hong Liu
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Y
Yuanyi Xia
Huawei Technologies Co., Ltd, Shenzhen, China 518129
J
Jiajia Chu
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Y
Yifeng Tang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Hui Zang
Hui Zang
UC Davis, Sprint, Guavus Inc., Google.
AInetworking
X
Xin Yao
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Q
Qijie Qiu
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 518060
Y
Yuzhao Wang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
C
Chuanfei Xu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107
L
Lin Zhang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Z
Zhuonan Lai
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107
H
Hongming Huang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
J
Jiawei Qiu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107
G
Gong Zhang
Huawei Technologies Co., Ltd, Shenzhen, China 518129
Z
Zhong Ming
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 518060
W
Weipeng Cao
Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen, China 518107; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 518060