🤖 AI Summary
This work addresses the challenge that memories written by upstream large language models (LLMs) are often poorly utilized by downstream models in multi-LLM switching scenarios. To this end, the authors propose a memory-centric LLM adaptation framework that jointly trains conditional operators for memory writing and reading to optimize both the storage and presentation of memory contents. The framework further incorporates a minimum-gain sampling curriculum and a performance-gap-based reward mechanism to enhance cross-model task performance. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches on HotpotQA, 2WikiMultihopQA, and MuSiQue benchmarks, while exhibiting strong generalization and robustness under unseen LLM substitutions.
📝 Abstract
Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizon planning, and continual self-improvement. Existing memory systems typically take the LLM as the center and design memory operations tailored to a specific backbone. In practice, however, users frequently switch between LLMs, for example using Claude for coding and GPT for writing across tasks, or routing different steps to different backbones within a single task for cost-effective trade-offs. As a result, memory written by one model often needs to be consumed by another. Making upstream memory effectively adapt to and activate downstream LLMs remains a critical yet underexplored problem. To bridge this gap, we shift the perspective from LLM-centric memory design to \emph{memory-centric LLM adaptation}. Specifically, we approach the above upstream-downstream memory adaptation problem from both the write and read sides, and design two profile-conditioned operators that are jointly trained to optimize how memory is stored and presented for better task completion. To ensure the learned operators generalize across a broad set of LLMs, we propose a minimum-gain sampling curriculum that prioritizes the least-served LLMs during training. To better measure the operators' actual contribution rather than the LLM's own capability, we design a performance-gap reward that compares against a naive memory baseline. Experiments on HotpotQA, 2WikiMultihopQA, and MuSiQue demonstrate that our model consistently outperforms baselines and remains robust under unseen-model replacement.