Demystifying Long Chain-of-Thought Reasoning in LLMs

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Long-chain-of-thought (long CoT) reasoning in large language models (LLMs) suffers from poorly understood emergence conditions, unstable training dynamics, and weak out-of-distribution (OOD) generalization. Method: We propose a three-part framework: (1) empirical identification of latent CoT self-correction capabilities in base models; (2) a noise-robust, scalable reward signal based on “network-extracted solutions + automated filtering”, integrated into joint supervised fine-tuning (SFT) and reinforcement learning (RL); and (3) a fine-grained CoT length metric and dynamic analysis framework to characterize synergies among compute budget, reward shaping, and signal verifiability. Contribution/Results: On OOD STEM benchmarks, our approach significantly improves long CoT generation in terms of length, stability, and generalization. We empirically validate a strong correlation between reward scalability and training efficiency, and provide a reproducible optimization pathway for long CoT alignment.

Technology Category

Application Category

📝 Abstract

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

Problem

Research questions and friction points this paper is trying to address.

Understanding long chain-of-thought reasoning in LLMs

Identifying key factors for long CoT generation

Optimizing training strategies for enhanced reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhances CoT reasoning

Supervised fine-tuning improves training efficiency

Noisy web solutions filtered for OOD tasks

🔎 Similar Papers

No similar papers found.