ATLAS: Agentic Test-time Learning-to-Allocate Scaling

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of conventional test-time scaling methods, which rely on static strategies and lack dynamic coordination mechanisms for reasoning. The authors propose ATLAS, a novel framework that delegates full control of test-time scaling to a large language model (LLM) agent, enabling end-to-end dynamic orchestration of multiple solvers. The LLM agent employs an “explore” action to decide when to gather evidence, when to terminate, and how to synthesize a final answer, supported by a stateful evidence management mechanism to enhance reasoning efficacy. ATLAS further supports an extensible action space and multi-model scheduling (ATLAS-MM). Experiments demonstrate that, using Claude Sonnet 4.6, ATLAS significantly outperforms baselines with lower API usage on HLE-Verified (56.00%), LiveCodeBench (82.29%), GPQA-Diamond (85.75%), and BabyVision (23.71%). ATLAS-MM further improves performance to 60.00% on HLE-Verified and 85.63% on LiveCodeBench.

📝 Abstract

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

agentic orchestration

compute allocation

reasoning coordination

LLM control loop

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic test-time scaling

LLM orchestrator

dynamic compute allocation