AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models in long-context tasks, which stem from the high computational and memory costs of self-attention and catastrophic forgetting. The authors propose AllMem, a novel architecture that uniquely integrates nonlinear test-time training (TTT) memory networks with sliding window attention (SWA), thereby overcoming the representational bottlenecks of linear memory models. This approach effectively mitigates forgetting without incurring the prohibitive cost of full global attention. Coupled with a memory-efficient fine-tuning strategy, AllMem enables efficient transfer of pretrained models. Experiments demonstrate that a 4k-window AllMem model scores only 0.83 points lower than full attention on the 37k-length LongBench benchmark, while an 8k-window variant even surpasses full attention on the 128k InfiniteBench, confirming its superior long-range modeling capability and efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
Problem

Research questions and friction points this paper is trying to address.

long-context modeling
self-attention
memory overhead
computational complexity
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding Window Attention
Test-Time Training
Memory-Augmented Architecture
Long-Context Modeling
Memory-Efficient Fine-Tuning
Z
Ziming Wang
ACS Lab, Huawei Technologies
X
Xiang Wang
ACS Lab, Huawei Technologies
K
Kailong Peng
ACS Lab, Huawei Technologies
L
Lang Qin
ACS Lab, Huawei Technologies
J
Juan Gabriel Kostelec
Huawei Switzerland
C
Christos Sourmpis
Huawei Switzerland
Axel Laborieux
Axel Laborieux
Research Scientist, Huawei Technologies
Neuromorphic computingComputational NeuroscienceLearning algorithms
Q
Qinghai Guo
ACS Lab, Huawei Technologies