🤖 AI Summary
This work addresses the challenge of credit assignment in multi-turn agent training, where conventional reinforcement learning struggles to effectively distribute credit across intermediate steps, and existing self-distillation approaches suffer from misaligned privileged feedback that does not match the current decision context. To overcome these limitations, the authors propose a hindsight-based self-distillation framework that, for the first time, transforms per-step environmental observations into actionable, action-level diagnostic feedback. At the end of each episode, this framework generates concise episode-level supervision signals that are contextually aligned with the student policy, enabling dense and targeted guidance. By integrating hindsight augmentation, diagnostic feedback generation, and self-distillation, the method significantly improves task success rates on TauBench and WebShop, demonstrating particularly strong performance in data-scarce regimes with limited training episodes and sparse successful trajectories.
📝 Abstract
Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.