🤖 AI Summary
This work addresses the challenge of linear memory growth in KV cache during long-context reasoning with large language models, which existing compression methods fail to resolve effectively due to their reliance on local attention patterns that often discard globally critical information. To overcome this limitation, the authors propose a structure-aware KV cache compression framework that identifies informational hubs via global in-degree centrality on the attention graph and dynamically selects optimal compression layers through an information-theoretic pivot selection mechanism. The approach further incorporates structural propagation and a decoupling strategy between storage and computation to preserve long-range dependencies. Extensive evaluation on LongBench and RULER benchmarks demonstrates that the method significantly enhances the scalability of long-context reasoning while maintaining robust performance on global semantic understanding and retrieval tasks.
📝 Abstract
As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.