🤖 AI Summary
Transformers exhibit inherent limitations in modeling long-range context, continual learning, and knowledge integration. To address these challenges, we propose a neuroscience-inspired memory-augmented unified framework that integrates multi-timescale memory, selective attention, and synaptic consolidation mechanisms—shifting from static caching to adaptive online learning. Methodologically, we design a synergistic architecture combining attention fusion, gating control, and associative retrieval, supported by a hybrid memory representation comprising parametric encoding, state-based internal representations, and explicit external memory. We further introduce a hierarchical buffering structure and a surprise-driven memory update strategy to mitigate capacity bottlenecks and catastrophic forgetting. Experiments demonstrate substantial improvements in long-sequence modeling stability and cross-task knowledge transfer. Our framework provides a scalable, biologically plausible pathway toward intelligent models capable of lifelong learning.
📝 Abstract
Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context retention, continual learning, and knowledge integration. This review presents a unified framework bridging neuroscience principles, including dynamic multi-timescale memory, selective attention, and consolidation, with engineering advances in Memory-Augmented Transformers. We organize recent progress through three taxonomic dimensions: functional objectives (context extension, reasoning, knowledge integration, adaptation), memory representations (parameter-encoded, state-based, explicit, hybrid), and integration mechanisms (attention fusion, gated control, associative retrieval). Our analysis of core memory operations (reading, writing, forgetting, and capacity management) reveals a shift from static caches toward adaptive, test-time learning systems. We identify persistent challenges in scalability and interference, alongside emerging solutions including hierarchical buffering and surprise-gated updates. This synthesis provides a roadmap toward cognitively-inspired, lifelong-learning Transformer architectures.