🤖 AI Summary
To address the instability and irreproducibility of multi-step reasoning by large language models (LLMs) in non-serializable reinforcement learning environments—such as Docker containers—this paper proposes a lightweight guided search strategy: one-step lookahead combined with trajectory selection, grounded in a learned action-value function. Unlike traditional search paradigms like Monte Carlo Tree Search (MCTS), our method avoids environment state saving or restoration, eliminating reliance on snapshots and thereby enhancing trial efficiency and environmental compatibility. This work marks the first deployment of practical guided search in non-serializable RL settings. Evaluated on the SWE-bench Verified benchmark, Qwen-72B achieves a 40.8% task-solving rate, establishing a new state-of-the-art among open-source models. Further validation with GPT-4o confirms strong cross-model generalizability and consistent performance gains.
📝 Abstract
Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for non-serializable RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving 40.8%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.