Probabilistic Optimality for Inference-time Scaling

πŸ“… 2025-06-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
While parallel sampling strategies (e.g., Best-of-N) are widely adopted during inference to enhance large language model (LLM) performance, they lack theoretical grounding, and the number of samples is typically chosen empirically. Method: This paper establishes the first probabilistic optimality framework for inference-time scaling, deriving a fundamental performance lower bound under independent and identically distributed (i.i.d.) sampling. Building on this, we propose OptScaleβ€”a principled algorithm that dynamically determines the minimal optimal sample size required to meet user-specified performance and confidence thresholds, leveraging the LLM’s own predicted prior parameters. Contribution/Results: OptScale bridges theoretical rigor with computational efficiency. Evaluated on multiple mathematical reasoning benchmarks, it reduces sampling overhead by 37%–62% on average while matching or surpassing state-of-the-art performance.

Technology Category

Application Category

πŸ“ Abstract
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop extsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. extsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that extsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.
Problem

Research questions and friction points this paper is trying to address.

Formalizes optimality of inference-time scaling for LLMs
Derives theoretical lower bound for compute-efficient scaling
Develops practical algorithm to minimize sampling overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic framework for optimal inference-time scaling
Theoretical lower bound for compute-efficient scaling
Dynamic algorithm for minimal sample determination
πŸ”Ž Similar Papers
No similar papers found.