🤖 AI Summary
To address high synchronization overhead, severe memory bottlenecks, and substantial latency in long-chain reasoning for large language models (LLMs) under speculative decoding, this paper proposes A1—the first framework integrating conformal prediction with asynchronous test-time scaling. Methodologically, A1 introduces an online calibration mechanism and a three-stage rejection sampling strategy to enable statistically reliable, low-overhead asynchronous inference scheduling; it supports both serial and parallel expansion, breaking the constraints of conventional synchronous paradigms. Through high arithmetic intensity optimization and dynamic confidence control, A1 achieves up to 56.7× speedup and 4.14× throughput improvement across multiple mathematical reasoning benchmarks, while significantly reducing latency and GPU memory consumption—all without compromising generation accuracy.
📝 Abstract
Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.