🤖 AI Summary
Large language models (LLMs) struggle to simultaneously achieve high accuracy and computational efficiency in complex reasoning tasks.
Method: This paper proposes Solve-Detect-Verify, a novel inference-time scaling framework that dynamically identifies solution completion points during reasoning (Detect) and invokes a generative verifier—FlexiVe—with tunable resource allocation (Verify). FlexiVe integrates error localization and adaptive computation budgeting, eliminating reliance on costly generative reward models (GenRMs).
Contribution/Results: The work introduces the first flexible verification budgeting mechanism, enabling synergistic “fast thinking” and “slow verification”; it departs from conventional post-hoc verification by enabling proactive, timing-aware process supervision. Experiments show significant improvements in error localization accuracy on ProcessBench and consistent gains in both accuracy and inference efficiency over strong baselines—including self-consistency—on AIME 2024/2025 and CNMO mathematical benchmarks.
📝 Abstract
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.