🤖 AI Summary
This work addresses the challenge of credit assignment in multi-turn large language model agent training, where existing group-based policy optimization methods often misinterpret noise as meaningful signals under varying task difficulty, leading to poor sample efficiency. To overcome this, the authors propose ProxMO, a framework that leverages a task-difficulty-aware success-rate prior and a semantic proximity-based soft aggregation mechanism to dynamically modulate gradient magnitudes and construct more accurate baselines within the global context, thereby reliably identifying genuine performance improvements. ProxMO circumvents the statistical biases inherent in conventional discrete batching and offers plug-and-play compatibility with minimal computational overhead, seamlessly integrating into existing training pipelines. Experiments on ALFWorld and WebShop benchmarks demonstrate substantial performance gains over current approaches, with ablation studies confirming the effectiveness and synergistic benefits of its core components.
📝 Abstract
Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{https://anonymous.4open.science/r/proxmo-B7E7/README.md}{https://anonymous.4open.science/r/proxmo}.