Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in tool-integrated reasoning caused by sparse delayed rewards and weak step-level credit assignment, particularly when early irreversible errors dominate task outcomes in long-horizon trajectories. We propose the first localized policy optimization framework specifically designed to handle irreversible errors: by identifying the first such error, we construct a binary-search-based rollback tree to generate fine-grained learning signals, and integrate hierarchical advantage attribution with error-local adaptive clipping to enable precise credit assignment and targeted correction of critical errors and their downstream actions. Evaluated on benchmarks spanning mathematical reasoning, scientific question answering, and code execution, our method significantly outperforms strong baselines, improving Pass@K and Major@K scalability, rollback ranking quality, and tool invocation efficiency.

Technology Category

Application Category

📝 Abstract
Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
Problem

Research questions and friction points this paper is trying to address.

tool-integrated reasoning
credit assignment
irrecoverable error
reinforcement learning
long-horizon reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-Localized Policy Optimization
Tool-Integrated Reasoning
Hierarchical Advantage Attribution
Irrecoverable Error Localization
Adaptive Clipping
🔎 Similar Papers
No similar papers found.