Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing chain-of-thought (CoT) methods generate the entire reasoning trace in a single forward pass, making them prone to error accumulation and reasoning derailment—especially for small language models on long-horizon reasoning tasks. Method: We propose a multi-path planning aggregation framework: (i) Twisted Sequential Monte Carlo generates variable-length candidate reasoning paths; (ii) LoRA enables lightweight path aggregation; and (iii) an online Step-DPO algorithm detects and corrects planning errors at the step level, enabling fine-grained, scalable training. Contribution/Results: Our method achieves significant improvements over DeepSeek-R1 distillation and outcome-reward RL baselines on mathematical, scientific, and logical reasoning benchmarks—despite using only 10% of supervised fine-tuning (SFT) data and 5% of preference pairs. It markedly enhances both the stability and accuracy of long-chain reasoning in small language models.

Technology Category

Application Category

📝 Abstract

Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses chain-of-thought derailment in long reasoning tasks

Improves planning accuracy through multi-path plan aggregation

Enables efficient training for small language models on long trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Path Plan Aggregation for refined reasoning steps

Lightweight LoRA module enabling efficient plan aggregation

Online Step-DPO with TSMC providing stepwise supervision

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting