🤖 AI Summary
To address the high computational overhead induced by chain-of-thought (CoT) reasoning in large language models, this paper presents a systematic survey of state-of-the-art inference acceleration techniques. We propose a novel “Short–Small–Fast” tri-dimensional taxonomy: “Short” denotes reasoning chain compression (e.g., pruning redundant steps); “Small” refers to model lightweighting via structural simplification (e.g., pruning, knowledge distillation); and “Fast” encompasses decoding optimizations (e.g., early termination, speculative/parallel decoding, reinforcement learning–based fine-tuning). This framework is the first to unify diverse efficient inference paradigms under a coherent conceptual structure, shifting the field from empirical heuristics toward principled, systematic design. Based on a comprehensive review of over 100 works, we curate an open-source repository of efficient CoT inference papers, annotated with theoretical categorizations, standardized evaluation dimensions (e.g., latency, accuracy trade-offs, memory footprint), and reproducible benchmarks—thereby enabling rigorous design, evaluation, and deployment of efficient CoT models.
📝 Abstract
Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this"slow-thinking"paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.