🤖 AI Summary
Current test-time verification mechanisms for large language models (LLMs) lack rigorous theoretical modeling; their efficacy is jointly governed by generator coverage, validator region of convergence (ROC), and suboptimality of sampling algorithms.
Method: This paper introduces the first optimal transport formulation of test-time verification, establishing a unified geometric analysis framework. It theoretically characterizes a three-phase suboptimality–coverage trade-off curve—transport, policy improvement, and saturation—highlighting the ROC’s pivotal regulatory role. Building on this, we design and analyze both sequential and batch sampling algorithms, explicitly quantifying the computational complexity–verification performance trade-off.
Results: Experiments on Qwen, Llama, and Gemma empirically validate the predicted three-phase behavior and theoretical bounds. Our framework provides an interpretable, quantifiable foundation for trustworthy LLM reasoning, bridging theory and practice in test-time verification.
📝 Abstract
While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage, depending on the verifier's ROC, and a saturation regime -- where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms -- sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.