SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address resource waste and response latency in large language model (LLM) inference caused by fixed decoding lengths, this paper proposes an adaptive token budget control framework. Methodologically, it introduces: (1) a dynamic cost estimation mechanism grounded in query difficulty; (2) a novel two-stage training paradigm coupled with Budget-Guided GPRO—a reinforcement learning algorithm enabling user-specified token ceilings, real-time generation interruption, and predictable decoding latency; and (3) an integrated strategy combining controllable decoding with dynamic token scheduling. Evaluated on the MATH benchmark, the framework achieves up to 74.47% reduction in response length while incurring less than a 0.3% accuracy drop. This yields substantial improvements in inference efficiency and user experience, demonstrating both computational savings and robust task performance under stringent budget constraints.

Technology Category

Application Category

📝 Abstract

Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.

Problem

Research questions and friction points this paper is trying to address.

Inefficient token allocation in LLMs for varying query complexities

Resource waste and prolonged latency due to uniform processing

Balancing accuracy and output length in adaptive reasoning strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-phase training for cost estimation

Budget-guided GPRO for reinforcement learning

Pre-filling token budget control

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting