CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

In reasoning tasks, test-time scaling methods (e.g., Best-of-N) suffer from diminishing returns. This paper proposes a training-free test-time calibration framework that dynamically adjusts the logits distribution via an input-dependent temperature coefficient (T) and offset vector (delta), thereby increasing the probability of high-reward reasoning paths without modifying model parameters. The method is agnostic to downstream decoding strategies and provides a theoretical guarantee on improved lower bounds for expected reward. Evaluated on MATH-500 and AIME-2024, it achieves up to 4× fewer rollouts at equal accuracy, or significantly higher accuracy under fixed computational budgets. Its core innovation lies in a provably effective, lightweight, and decoding-strategy-orthogonal dynamic logits calibration mechanism.

Technology Category

Application Category

📝 Abstract

Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $δ$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4 imes$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $δ$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.

Problem

Research questions and friction points this paper is trying to address.

Improves reasoning efficiency via calibrated sampling

Addresses diminishing returns in Best-of-N sampling

Enhances test-time reasoning without LLM retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibrated Best-of-N sampling adapts model toward high-reward reasoning paths

Learns input-specific temperature and shift vector for logit calibration

Improves efficiency with fewer rollouts while maintaining accuracy

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting