SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant challenges in attention inference and decoding efficiency of large language models on resource-constrained edge devices. The authors propose SwiftKV Attention, an algorithm that leverages a per-token pipelined mechanism to compute attention in a single pass, eliminating score materialization and redundant memory accesses. They further design the SwiftKV-MHA accelerator, which unifies high-precision attention computation with low-precision GEMV operations within a single hardware architecture, enabling single-pass execution without block-wise softmax and one-time processing of the KV cache. This approach achieves the first low-latency attention mechanism without additional parallelism overhead, substantially improving multi-head decoding efficiency on edge devices: SwiftKV Attention accelerates inference by 7.16× over native attention, while SwiftKV-MHA reduces attention latency by 13.48×, yielding a 17.4% increase in generation speed and a 1.98× improvement in token efficiency.

Technology Category

Application Category

📝 Abstract
Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.
Problem

Research questions and friction points this paper is trying to address.

edge acceleration
large language models
attention inference
multi-head decoding
KV cache
Innovation

Methods, ideas, or system contributions that make the work stand out.

edge acceleration
single-pass attention
KV cache pipelining
multi-head parallel decoding
heterogeneous precision processing
🔎 Similar Papers
No similar papers found.
Junming Zhang
Junming Zhang
Zhejiang University
Power ElectronicsPower management
Q
Qinyan Zhang
School of Integrated Circuits, Hubei Key Laboratory of Advanced Memories, Huazhong University of Science and Technology, Wuhan 430074, China
Huajun Sun
Huajun Sun
Nantes University
computer vision
F
Feiyang Gao
School of Integrated Circuits, Hubei Key Laboratory of Advanced Memories, Huazhong University of Science and Technology, Wuhan 430074, China
Sheng Hu
Sheng Hu
Postdoc Research Associate, Chemical Engineering, University of Tennessee, Knoxville
CatalysisLaser ablation in liquidEnergy conversion and storageOrganic photovoltaicsPolymer synthesis and processing
R
Rui Nie
School of Integrated Circuits, Hubei Key Laboratory of Advanced Memories, Huazhong University of Science and Technology, Wuhan 430074, China
X
Xiangshui Miao
School of Integrated Circuits, Hubei Key Laboratory of Advanced Memories, Huazhong University of Science and Technology, Wuhan 430074, China