FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

To address redundant KV cache accesses and load imbalance during decoding in large language models with prefix sharing, this paper proposes Shared-Prefix Attention (SPA), a novel attention kernel. First, it introduces a prefix-tree-aware memory access pattern to enhance cache locality and hierarchical memory utilization. Second, it designs intra- and inter-block cooperative parallelism to maximize hardware throughput. Third, it pioneers a cost-estimation–driven dynamic load-balancing strategy to mitigate irregular computation and memory access patterns induced by variable-length shared prefixes. Experiments show that SPA achieves 1.9× average speedup over FlashDecoding and reduces KV cache memory accesses by 120.9×. End-to-end per-token latency is reduced by 3.8× compared to vLLM. This work is the first to systematically tackle the irregular compute and memory-access challenges arising from prefix sharing, establishing a new paradigm for efficient LLM inference.

Technology Category

Application Category

📝 Abstract

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely FlashForge. FlashForge delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that FlashForge achieves an average 1.9x speedup and 120.9x memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and 3.8x end-to-end time per output token compared to the vLLM.

Problem

Research questions and friction points this paper is trying to address.

Optimizing memory access for shared prefixes in LLM decoding

Handling tree-structured prefix-sharing in attention computation

Balancing workloads for efficient KV cache processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-sharing optimizes memory access in decoding

Shared-prefix attention kernel enhances parallelism

Workload balancing mechanism improves task scheduling

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling