🤖 AI Summary
This work addresses the inefficiency and lack of real-time controllability in large language model (LLM) inference, which often stems from fixed computational pathways. The authors formulate reasoning guidance as a Markov decision process and introduce a controller agent that dynamically selects strategies and steering prompts during inference based on the current reasoning trajectory and the remaining computational budget. This approach enables efficient, continuous, and controllable reasoning by adaptively balancing accuracy and efficiency while preserving output coherence. To the best of the authors’ knowledge, it is the first method to employ an agent-based mechanism for dynamic, budget-aware control over chain-of-thought reasoning. Experiments demonstrate that the proposed method significantly reduces token consumption across multiple benchmarks while closely matching the performance of full-length reasoning, and it effectively generalizes across tasks and models to achieve flexible accuracy–efficiency trade-offs.
📝 Abstract
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.