Trust but Verify! A Survey on Verification Design for Test-time Scaling

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The test-time scaling (TTS) field lacks a systematic survey, unified taxonomy, and principled analysis of verification methods. Method: We propose the first taxonomy framework for TTS verifiers, structured along three dimensions—verifier type, training paradigm (prompt-based guidance, discriminative/generative fine-tuning), and application mode—and integrate search-space exploration with candidate-output scoring for efficient inference optimization. We further provide a comprehensive survey of existing verification techniques and release an open-source verification resource repository. Contribution/Results: Our framework fills a critical gap in systematic TTS verification research, significantly improving the accuracy and reliability of large language model inference. It establishes a standardized foundation and reproducible benchmark for verifier design in TTS, enabling principled development and evaluation of verification mechanisms.

Technology Category

Application Category

📝 Abstract
Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
Problem

Research questions and friction points this paper is trying to address.

Surveying verification design approaches for test-time scaling
Categorizing verifier training mechanisms and types
Analyzing utility of verifiers in LLM inference optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier models score candidate outputs
Parameter-free scaling during inference
Unified view of verifier training types