🤖 AI Summary
This work addresses the challenge of effective memory management for long-horizon large language model agents operating under constrained memory budgets, requiring principled decisions on encoding, forgetting, and retrieval. The authors propose the first cognitively inspired, multi-factor memory valuation model, formulating a linear value function grounded in seven interpretable factors—including emotional intensity and goal relevance—and employ a gradient-free optimizer to automatically learn factor weights that jointly govern memory policies. Evaluated under blind settings with no knowledge of future queries, the approach significantly outperforms single-factor, uniformly weighted, and recency-based baselines on both the LongMemEval benchmark and synthetic tasks, retaining 77.0% of gold evidence across 479 test cases compared to 65.7% for the best baseline, while yielding highly interpretable learned weights.
📝 Abstract
Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.