SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) suffer from low test-time compute efficiency, diminishing returns in scaling, and reliance on costly, task-specific reward models for complex reasoning tasks. To address these limitations, we propose the first test-time scaling framework integrating self-verification and self-correction—termed Self-Verify-and-Correct Scaling—without requiring any external reward model. Our method operates via a three-stage collaborative mechanism: sampling, verification, and correction, augmented by dynamic path selection and iterative refinement to enhance reasoning efficiency. Empirically, it achieves substantial improvements on planning and multi-step reasoning benchmarks (e.g., HotpotQA, ALFWorld), outperforming majority voting and reward-model-based baselines under identical compute budgets. Moreover, its scaling curve exhibits markedly reduced saturation, demonstrating superior scalability and generalization across diverse reasoning tasks.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, conventional approaches such as repeated sampling with majority voting or reward model scoring, often face diminishing returns as test-time compute scales, in addition to requiring costly task-specific reward model training. In this paper, we present Self-Enhanced Test-Time Scaling (SETS), a novel method that leverages the self-verification and self-correction capabilities of recent advanced LLMs to overcome these limitations. SETS integrates sampling, self-verification, and self-correction into a unified framework, enabling efficient and scalable test-time computation for improved capabilities at complex tasks. Through extensive experiments on challenging planning and reasoning benchmarks, compared to the alternatives, we demonstrate that SETS achieves significant performance improvements and more favorable test-time scaling laws.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Complex Reasoning Tasks
Reward Model Training Costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-enhancing Testing-time Extension (SETS)
Self-inspection and Self-correction
Enhanced Performance on Complex Tasks
🔎 Similar Papers
No similar papers found.
J
Jiefeng Chen
Google Cloud AI Research
J
Jie Ren
Google DeepMind
X
Xinyun Chen
Google DeepMind
Chengrun Yang
Chengrun Yang
Research Scientist, Google DeepMind
Machine LearningOptimizationLarge Language Models
R
Ruoxi Sun
Google Cloud AI Research
S
Sercan Ö. Arik
Google Cloud AI Research