MEMENTO: Teaching LLMs to Manage Their Own Context

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the issue of excessive context expansion and computational overhead during large language model (LLM) inference, caused by verbose and unstructured chains of thought. The authors propose MEMENTO, a novel method that enables LLMs to autonomously manage their reasoning context for the first time. MEMENTO employs a dual information flow architecture to partition reasoning into blocks and compress them into concise “memory point” summaries, allowing subsequent reasoning steps to proceed solely based on these summaries. The approach integrates two-stage supervised fine-tuning, reinforcement learning, and vLLM optimizations, leveraging OpenMementos—a newly introduced open-source dataset of 228,000 summary-augmented reasoning trajectories. Evaluated on 8B–32B models including Qwen3, Phi-4, and Olmo3, MEMENTO maintains high accuracy across mathematical, scientific, and programming tasks while reducing peak KV cache usage by up to 2.5× and improving inference throughput by 1.75×.

Technology Category

Application Category

📝 Abstract

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

Problem

Research questions and friction points this paper is trying to address.

reasoning compression

context management

KV cache reduction

intermediate state summarization

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

MEMENTO

reasoning compression

KV cache reduction