Efficient Reasoning Models: A Survey

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the high computational overhead induced by chain-of-thought (CoT) reasoning in large language models, this paper presents a systematic survey of state-of-the-art inference acceleration techniques. We propose a novel “Short–Small–Fast” tri-dimensional taxonomy: “Short” denotes reasoning chain compression (e.g., pruning redundant steps); “Small” refers to model lightweighting via structural simplification (e.g., pruning, knowledge distillation); and “Fast” encompasses decoding optimizations (e.g., early termination, speculative/parallel decoding, reinforcement learning–based fine-tuning). This framework is the first to unify diverse efficient inference paradigms under a coherent conceptual structure, shifting the field from empirical heuristics toward principled, systematic design. Based on a comprehensive review of over 100 works, we curate an open-source repository of efficient CoT inference papers, annotated with theoretical categorizations, standardized evaluation dimensions (e.g., latency, accuracy trade-offs, memory footprint), and reproducible benchmarks—thereby enabling rigorous design, evaluation, and deployment of efficient CoT models.

Technology Category

Application Category

📝 Abstract

Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this"slow-thinking"paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in slow-thinking reasoning models

Compressing lengthy Chain-of-Thoughts into concise reasoning chains

Developing compact models and faster decoding for efficient reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compress lengthy CoTs into concise reasoning chains

Develop compact models via knowledge distillation

Design efficient decoding strategies for acceleration

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting