CacheFormer: High Attention-Based Segment Caching

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the trade-off between efficiency and accuracy in Transformer-based long-context modeling, this paper proposes a segment-level attention-triggered dynamic caching mechanism. Inspired by computer cache hierarchies and virtual memory management, the method partitions input sequences into compressible segments and dynamically schedules decompression and loading of neighboring segments only upon detecting high segment-level attention scores. Integrating sliding windows, overlapping segments, and top-k real-time decompression, it mitigates boundary fragmentation inherent in segment-based approaches. The resulting multi-granularity attention architecture—combining compressed segments, dynamic decompression, sliding windows, and overlapping segments—achieves an average 8.5% reduction in perplexity over baseline models of comparable size. It significantly outperforms state-of-the-art methods including Linformer, Longformer, Performer, and leading structured state space models (SSMs).

Technology Category

Application Category

📝 Abstract

Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs)., have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data is retrieved from the memory, but the adjacent data is also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our en-hancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top k high attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms ex-isting SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.

Problem

Research questions and friction points this paper is trying to address.

Efficiently handling long contexts in transformers

Reducing quadratic time complexity of attention

Minimizing quality loss in context compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment caching with high attention retrieval

Combining four attention mechanisms effectively

Overlapping segments to prevent fragmentation

🔎 Similar Papers

No similar papers found.