🤖 AI Summary
Traditional large language models employ a static training paradigm, which struggles to address the diverse failure modes arising from dynamic bottlenecks and sparse rewards in reinforcement learning. This work proposes EvoTrainer, a novel framework that enables the first co-evolution of policy and training infrastructure. EvoTrainer identifies root causes of failure at the trajectory level, iteratively refines its diagnostic logic, validates interventions through backtesting, and constructs a reusable skill library across tasks. By integrating evolvable diagnostics with skill reuse mechanisms, the approach significantly enhances robustness and training efficiency in long-horizon tasks. Evaluated on mathematical reasoning, competitive programming, and warehouse-scale software engineering benchmarks, EvoTrainer matches or surpasses handcrafted reinforcement learning baselines, demonstrating particularly strong performance in extended-agent software engineering scenarios.
📝 Abstract
Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.