ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
This work addresses the tendency of large reasoning models to fall into β€œoverthinking” during outcome-based reinforcement learning due to redundant exploratory steps. To mitigate this inefficiency, the authors propose ThoughtFold, a novel framework that introduces, for the first time, a fine-grained introspective preference learning mechanism. This mechanism employs an introspective policy to identify and isolate redundant segments within otherwise correct reasoning trajectories, constructs a sub-trajectory spectrum, and formulates a masked preference optimization objective to explicitly suppress unproductive exploration. By directly linking critical reasoning steps, ThoughtFold transcends the conventional paradigm that relies solely on outcome-based rewards. Evaluated on DeepSeek-R1-Distill-Qwen-7B, the method reduces token consumption by approximately 56% while preserving state-of-the-art reasoning accuracy.
πŸ“ Abstract
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.
Problem

Research questions and friction points this paper is trying to address.

Large Reasoning Models
Chain-of-Thought
redundant exploration
over-thinking
reasoning efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspective Preference Learning
Chain-of-Thought Folding
Redundancy Reduction
Masked Preference Optimization
Efficient Reasoning