🤖 AI Summary
This work addresses the limitations of large language models in long-context tasks, which stem from the high computational and memory costs of self-attention and catastrophic forgetting. The authors propose AllMem, a novel architecture that uniquely integrates nonlinear test-time training (TTT) memory networks with sliding window attention (SWA), thereby overcoming the representational bottlenecks of linear memory models. This approach effectively mitigates forgetting without incurring the prohibitive cost of full global attention. Coupled with a memory-efficient fine-tuning strategy, AllMem enables efficient transfer of pretrained models. Experiments demonstrate that a 4k-window AllMem model scores only 0.83 points lower than full attention on the 37k-length LongBench benchmark, while an 8k-window variant even surpasses full attention on the 128k InfiniteBench, confirming its superior long-range modeling capability and efficiency.
📝 Abstract
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.