Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

While test-time scaling can enhance the reasoning performance of large language models, it incurs substantial computational overhead and latency. Existing adaptive sampling methods often rely on heuristic rules or distributional assumptions, making it challenging to efficiently balance accuracy and cost. This work formulates adaptive sampling as a Markov decision process and introduces the first lightweight reinforcement learning–based controller that operates solely on final-answer statistics, enabling training and deployment entirely on CPU. By incorporating Lagrangian relaxation, the approach achieves joint optimization under explicit budget constraints. Experiments demonstrate that the proposed method significantly improves the trade-off among accuracy, number of sampling rounds, and total sample count across multiple strong baselines, including ASC and ESC.

📝 Abstract

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

adaptive sampling

computation cost

latency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

adaptive sampling

test-time scaling