MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

While parallel testing enhances the reasoning accuracy of large language models, it incurs substantial computational costs by requiring all reasoning trajectories to run to completion. This work proposes a risk-controlled early stopping mechanism that dynamically estimates the probability of answer changes at intermediate checkpoints. By integrating a five-feature logistic regression model capturing trajectory-level switching behavior with adversarial boundary calibration, the method disentangles epistemic from aleatoric uncertainty, enabling early termination of low-potential trajectories under high confidence. Evaluated across three reasoning models and mathematical competition benchmarks, the approach reduces token consumption by 25–47% compared to full-budget self-consistency strategies and further cuts computational cost by 14–29% over the strong DeepConf Online baseline, while maintaining comparable accuracy.

📝 Abstract

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

early stopping

computational overhead

majority voting

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

early stopping

majority voting