SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

πŸ“… 2025-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address resource waste and response latency in large language model (LLM) inference caused by fixed decoding lengths, this paper proposes an adaptive token budget control framework. Methodologically, it introduces: (1) a dynamic cost estimation mechanism grounded in query difficulty; (2) a novel two-stage training paradigm coupled with Budget-Guided GPROβ€”a reinforcement learning algorithm enabling user-specified token ceilings, real-time generation interruption, and predictable decoding latency; and (3) an integrated strategy combining controllable decoding with dynamic token scheduling. Evaluated on the MATH benchmark, the framework achieves up to 74.47% reduction in response length while incurring less than a 0.3% accuracy drop. This yields substantial improvements in inference efficiency and user experience, demonstrating both computational savings and robust task performance under stringent budget constraints.

Technology Category

Application Category

πŸ“ Abstract
Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.
Problem

Research questions and friction points this paper is trying to address.

Inefficient token allocation in LLMs for varying query complexities
Resource waste and prolonged latency due to uniform processing
Balancing accuracy and output length in adaptive reasoning strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-phase training for cost estimation
Budget-guided GPRO for reinforcement learning
Pre-filling token budget control
πŸ”Ž Similar Papers
No similar papers found.
Z
Zheng Li
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Qingxiu Dong
Qingxiu Dong
Peking University
Natural Language ProcessingMachine Learning
J
Jingyuan Ma
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
D
Di Zhang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhifang Sui
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University