SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the significant challenges in attention inference and decoding efficiency of large language models on resource-constrained edge devices. The authors propose SwiftKV Attention, an algorithm that leverages a per-token pipelined mechanism to compute attention in a single pass, eliminating score materialization and redundant memory accesses. They further design the SwiftKV-MHA accelerator, which unifies high-precision attention computation with low-precision GEMV operations within a single hardware architecture, enabling single-pass execution without block-wise softmax and one-time processing of the KV cache. This approach achieves the first low-latency attention mechanism without additional parallelism overhead, substantially improving multi-head decoding efficiency on edge devices: SwiftKV Attention accelerates inference by 7.16× over native attention, while SwiftKV-MHA reduces attention latency by 13.48×, yielding a 17.4% increase in generation speed and a 1.98× improvement in token efficiency.

Technology Category

Application Category

📝 Abstract

Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.

Problem

Research questions and friction points this paper is trying to address.

edge acceleration

large language models

attention inference

multi-head decoding

KV cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

edge acceleration

single-pass attention

KV cache pipelining