Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability and irreproducibility of multi-step reasoning by large language models (LLMs) in non-serializable reinforcement learning environments—such as Docker containers—this paper proposes a lightweight guided search strategy: one-step lookahead combined with trajectory selection, grounded in a learned action-value function. Unlike traditional search paradigms like Monte Carlo Tree Search (MCTS), our method avoids environment state saving or restoration, eliminating reliance on snapshots and thereby enhancing trial efficiency and environmental compatibility. This work marks the first deployment of practical guided search in non-serializable RL settings. Evaluated on the SWE-bench Verified benchmark, Qwen-72B achieves a 40.8% task-solving rate, establishing a new state-of-the-art among open-source models. Further validation with GPT-4o confirms strong cross-model generalizability and consistent performance gains.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for non-serializable RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving 40.8%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM consistency in multi-step tasks
Enabling guided search in non-serializable environments
Boosting software engineering agent performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

1-step lookahead search strategy
Trajectory selection guided by Q-function
Action-value function estimator guidance
🔎 Similar Papers
No similar papers found.
K
Karina Zainullina
Nebius
A
Alexander Golubev
Nebius
Maria Trofimova
Maria Trofimova
Nebius AI
S
Sergei Polezhaev
Nebius
I
Ibragim Badertdinov
Nebius
D
Daria Litvintseva
Nebius
S
Simon Karasik
Nebius
F
Filipp Fisin
Nebius
S
Sergei Skvortsov
Nebius
M
Maksim Nekrashevich
Nebius
A
Anton Shevtsov
Nebius
B
Boris Yangel
Nebius