HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the high computational and memory costs of chain-of-thought (CoT) reasoning in large language models, which existing compression methods exacerbate by discarding fine-grained information and degrading accuracy. To overcome this trade-off, the paper proposes HybridThinker, a novel approach that integrates temporary retention of critical reasoning steps with memory token compression. It employs a hybrid training strategy wherein some intermediate reasoning steps remain visible to subsequent layers while others are masked, compelling the model to learn efficient compression and retrieval of essential information. This method achieves comparable performance to uncompressed baselines across four reasoning benchmarks while significantly reducing resource overhead. On average, HybridThinker outperforms current CoT compression techniques by 5.8 percentage points in accuracy, with similar inference latency.

📝 Abstract

Extended chain-of-thought (CoT) traces improve LLM reasoning but incur substantial computational and memory costs. While existing CoT compression methods mitigate this by condensing thought steps into compact representations via memory tokens and retaining only these representations at inference time, the loss of fine-grained information makes subsequent steps more error-prone. To alleviate this, we propose \textbf{HybridThinker}, where in addition to preserved these representations, thought steps are also temporarily retained to provide fine-grained details. However, we observe that naively keeping thought steps accessible to subsequent steps \emph{during training} lets the model bypass memory tokens by retrieving information directly from these steps, leaving the model's ability to compress and retrieve information through memory tokens insufficiently trained. We therefore introduce a hybrid training scheme, in which only some thought steps are directly accessible through attention to subsequent steps, while the other thought steps are masked, forcing the model to use memory tokens for compression and retrieval. Across 4 reasoning benchmarks, HybridThinker matches the uncompressed baseline, advancing the state of the art in CoT compression by 5.8 points on average accuracy with similar inference time. Ablation studies confirm that both temporary thought-step retention and the hybrid training scheme contribute to these gains.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

CoT compression

memory tokens

reasoning

fine-grained information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Compression

Memory Tokens

Hybrid Training