Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the pervasive imbalance between “overthinking” and “underthinking” in large language model (LLM) inference, this paper proposes a test-time dynamic budget allocation framework: Bayesian modeling–guided subproblem decomposition coupled with uncertainty-driven adaptive token scheduling, enabling hierarchical, model-agnostic, and fine-tuning–free computational resource allocation. We introduce the first Bayesian Budget Allocation Model (BBAM) theoretical framework and the E³ efficiency metric, supporting test-time scaling. Extensive experiments across diverse tasks and models demonstrate up to 70% accuracy improvement, 39% reduction in token consumption, and an 187.5% gain in E³ efficiency. Notably, a 32B-parameter model achieves inference efficiency comparable to a 70B model—marking the first demonstration of Pareto-optimal trade-offs between accuracy and efficiency in LLM inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.

Problem

Research questions and friction points this paper is trying to address.

LLMs inefficient due to overthinking in reasoning tasks

Fixed token budgets cause underthinking in complex problems

Unclear problem-solving strategies lead to computational inefficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Budget Allocation Model (BBAM) for reasoning

Plan-and-Budget framework for adaptive token allocation

E3 metric to balance correctness and efficiency

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering