🤖 AI Summary
To address the high computational cost and latency of chain-of-thought large language models (e.g., O1), this paper proposes Length-Coordinated Fine-tuning (LCF), a framework for inference efficiency optimization under accuracy constraints. Methodologically, LCF introduces a novel dynamic token budget allocation mechanism that jointly leverages pre-sampling performance estimation and reinforcement-learning–inspired constrained optimization to enable difficulty-aware step-length control. It further incorporates an accuracy-preserving sequence length minimization objective and dynamic reasoning path pruning. Evaluated on multiple mathematical reasoning benchmarks, LCF reduces average inference latency by up to 42% while improving accuracy by +1.3–2.7 percentage points. Notably, it achieves adaptive inference length compression without compromising—indeed, while enhancing—model performance, establishing a new paradigm for efficient deployment of long-thinking LLMs.
📝 Abstract
Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner