Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current test-time verification mechanisms for large language models (LLMs) lack rigorous theoretical modeling; their efficacy is jointly governed by generator coverage, validator region of convergence (ROC), and suboptimality of sampling algorithms. Method: This paper introduces the first optimal transport formulation of test-time verification, establishing a unified geometric analysis framework. It theoretically characterizes a three-phase suboptimality–coverage trade-off curve—transport, policy improvement, and saturation—highlighting the ROC’s pivotal regulatory role. Building on this, we design and analyze both sequential and batch sampling algorithms, explicitly quantifying the computational complexity–verification performance trade-off. Results: Experiments on Qwen, Llama, and Gemma empirically validate the predicted three-phase behavior and theoretical bounds. Our framework provides an interpretable, quantifiable foundation for trustworthy LLM reasoning, bridging theory and practice in test-time verification.

Technology Category

Application Category

📝 Abstract

While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage, depending on the verifier's ROC, and a saturation regime -- where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms -- sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Quantifying the interplay between generator coverage and verifier convergence

Analyzing how sampling sub-optimality affects verification performance regimes

Developing transport framework for verifiable test-time scaling in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frames verification as optimal transport problem

Characterizes coverage, ROC, and sub-optimality interactions

Proposes sequential and batched sampling algorithms

🔎 Similar Papers

2024-08-24arXiv.orgCitations: 0