🤖 AI Summary
To address the high latency of chain-of-thought (CoT) reasoning in large language models (LLMs) caused by autoregressive generation, this paper proposes the “Silent Thinking” (ST) paradigm: distilling explicit CoT into non-autoregressive, implicit reasoning states. Methodologically, we design a dual-path collaborative training framework and introduce a lightweight Reasoning Evolution Module (REM). Through self-supervised distillation and latent-state alignment, a small number of ST tokens dynamically evolve into rich, inference-laden implicit representations. Experiments demonstrate that ST achieves accuracy comparable to state-of-the-art CoT methods while significantly reducing inference latency and computational overhead—enabling deployment in low-latency scenarios. The core contribution is the first end-to-end distillation framework that transforms CoT into non-autoregressive, implicit reasoning, establishing a novel paradigm for efficient LLM reasoning.
📝 Abstract
Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose extbf{DART} ( extbf{D}istilling extbf{A}utoregressive extbf{R}easoning to Silent extbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.