FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address redundant KV cache accesses and load imbalance during decoding in large language models with prefix sharing, this paper proposes Shared-Prefix Attention (SPA), a novel attention kernel. First, it introduces a prefix-tree-aware memory access pattern to enhance cache locality and hierarchical memory utilization. Second, it designs intra- and inter-block cooperative parallelism to maximize hardware throughput. Third, it pioneers a cost-estimation–driven dynamic load-balancing strategy to mitigate irregular computation and memory access patterns induced by variable-length shared prefixes. Experiments show that SPA achieves 1.9× average speedup over FlashDecoding and reduces KV cache memory accesses by 120.9×. End-to-end per-token latency is reduced by 3.8× compared to vLLM. This work is the first to systematically tackle the irregular compute and memory-access challenges arising from prefix sharing, establishing a new paradigm for efficient LLM inference.

Technology Category

Application Category

📝 Abstract
Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely FlashForge. FlashForge delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that FlashForge achieves an average 1.9x speedup and 120.9x memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and 3.8x end-to-end time per output token compared to the vLLM.
Problem

Research questions and friction points this paper is trying to address.

Optimizing memory access for shared prefixes in LLM decoding
Handling tree-structured prefix-sharing in attention computation
Balancing workloads for efficient KV cache processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-sharing optimizes memory access in decoding
Shared-prefix attention kernel enhances parallelism
Workload balancing mechanism improves task scheduling
🔎 Similar Papers
No similar papers found.