Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the challenge of stepwise optimization reasoning in real-world decision-making tasks characterized by vast feasible solution spaces. To this end, we introduce the OPT* family of tasks, which leverages a feasibility checker and an evaluator to construct a scalable benchmark for optimization reasoning with controllable complexity. We propose a novel task framework that expands the search space without requiring additional human annotations and establish a theoretical connection between reasoning efficiency and the amount of information gained per unit of search budget. Methodologically, our approach integrates solver-guided online policy optimization, rank-based reward shaping, and search-driven offline reinforcement learning. Experimental results demonstrate that each component substantially improves search efficiency and that training on OPT* effectively enhances large language models’ capacity for stepwise optimization reasoning.
📝 Abstract
Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.
Problem

Research questions and friction points this paper is trying to address.

step-by-step reasoning
optimization-like reasoning
search space expansion
feasible plan selection
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-by-step optimization
search space expansion
solver-guided policy optimization
offline reinforcement learning
reward shaping
🔎 Similar Papers
2024-10-04International Conference on Learning RepresentationsCitations: 9