Not All Tokens Are What You Need In Thinking

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Redundant tokens in large language models’ chain-of-thought (CoT) reasoning cause high inference latency, computational waste, and overthinking. Method: We propose Conditional Token Selection (CTS), a dynamic, fine-grained token-level compression framework. CTS introduces a novel conditional importance scoring mechanism enabling adaptive compression ratios and empirically demonstrates that 75.8% of CoT inference tokens can be safely pruned without significant performance degradation. The method comprises token-level supervised compression, a lightweight trainable token selector, and supervised fine-tuning on compressed CoT sequences. Results: On GPQA, Qwen2.5-14B-Instruct achieves a 9.1% accuracy gain with 13.2% fewer inference tokens; under 42% training token reduction, inference tokens drop by 75.8% with only a 5% accuracy loss—marking the first systematic, token-level CoT compression framework with provable efficacy and controllable trade-offs.

Technology Category

Application Category

📝 Abstract
Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.
Problem

Research questions and friction points this paper is trying to address.

Reduces inference latency in reasoning models
Minimizes computational resource consumption
Eliminates redundant tokens in chains of thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Token Selection for compression
Flexible variable compression ratio framework
Importance scoring to preserve essential tokens
🔎 Similar Papers
No similar papers found.
H
Hang Yuan
East China Normal University
B
Bin Yu
Harbin Institute of Technology
H
Haotian Li
Harbin Institute of Technology
S
Shijun Yang
University of Science and Technology of China
Christina Dan Wang
Christina Dan Wang
New York University Shanghai
Z
Zhou Yu
East China Normal University
X
Xueyin Xu
Zhongguancun Academy, Zhongguancun Institute of Artificial Intelligence
Weizhen Qi
Weizhen Qi
Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence
Natural Language Processing
K
Kai Chen
Zhongguancun Academy, Zhongguancun Institute of Artificial Intelligence