🤖 AI Summary
To address information decay and semantic fragmentation in large language models (LLMs) during long-context understanding, this work draws inspiration from biological working memory and cortical modularity. We propose a novel architecture featuring: (1) Persistent Activation (PA), a mechanism that emulates sustained prefrontal neuronal firing to enable dynamic reuse of critical hidden states; and (2) Cortical Expert Clustering (CE), which performs semantic-driven, task-adaptive clustering of feed-forward network (FFN) weights to mitigate semantic fragmentation. The method is fully compatible with existing LLMs and supports activation-level memory banking and cross-token dependency modeling. Experiments demonstrate consistent improvements: +6% on LongBench multi-document QA, +12.5–17.5% on Infinite-Bench, and robust performance on needle-in-a-haystack tasks up to 200K tokens. Our approach significantly enhances long-range contextual modeling capability while improving interpretability.
📝 Abstract
While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.