Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

In long-horizon, multi-turn tool-augmented tasks, reinforcement learning (RL) fine-tuning of LLM-based agents is hindered by fixed-context-length constraints, leading to degraded instruction following, high rolling computational cost, and poor training scalability. To address this, we propose SUPO—a novel RL algorithm that for the first time deeply integrates a learned summarization mechanism into the RL framework, enabling end-to-end joint optimization of the policy network and a history summarizer. During both training and inference, SUPO collaboratively compresses dialogue history via dynamic LLM-generated summaries, effectively bypassing fixed context-window limitations. The summarizer is updated jointly with the policy using policy gradients. Experiments demonstrate that SUPO significantly improves success rates on function-calling and search tasks while maintaining lower or comparable context overhead. Moreover, increasing the number of summary generations at test time enhances generalization to more complex tasks.

Technology Category

Application Category

📝 Abstract

We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with underline{SU}mmarization augmented underline{P}olicy underline{O}ptimization ( exttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that exttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, exttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

Problem

Research questions and friction points this paper is trying to address.

Addressing context length bottleneck in multi-turn LLM reinforcement learning

Optimizing tool-use behaviors and summarization strategies end-to-end

Enabling long-horizon training beyond fixed context window limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-generated summaries for context compression

End-to-end optimization of tool-use and summarization strategies

Enables long-horizon training beyond fixed context limits

🔎 Similar Papers

No similar papers found.