Not All Tokens Are What You Need In Thinking

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Redundant tokens in large language models’ chain-of-thought (CoT) reasoning cause high inference latency, computational waste, and overthinking. Method: We propose Conditional Token Selection (CTS), a dynamic, fine-grained token-level compression framework. CTS introduces a novel conditional importance scoring mechanism enabling adaptive compression ratios and empirically demonstrates that 75.8% of CoT inference tokens can be safely pruned without significant performance degradation. The method comprises token-level supervised compression, a lightweight trainable token selector, and supervised fine-tuning on compressed CoT sequences. Results: On GPQA, Qwen2.5-14B-Instruct achieves a 9.1% accuracy gain with 13.2% fewer inference tokens; under 42% training token reduction, inference tokens drop by 75.8% with only a 5% accuracy loss—marking the first systematic, token-level CoT compression framework with provable efficacy and controllable trade-offs.

Technology Category

Application Category

📝 Abstract

Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.

Problem

Research questions and friction points this paper is trying to address.

Reduces inference latency in reasoning models

Minimizes computational resource consumption

Eliminates redundant tokens in chains of thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Token Selection for compression

Flexible variable compression ratio framework

Importance scoring to preserve essential tokens

🔎 Similar Papers

No similar papers found.