Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current understanding of how reinforcement learning (RL) enhances reasoning capabilities during post-training remains unclear. This study addresses this gap through controlled mathematical reasoning experiments, explicitly disentangling and validating two core mechanisms in post-training: policy selection and policy improvement. Leveraging the Qwen-2.5-1.5B model with diverse supervised fine-tuning (SFT) data and progressively harder RL data, the research demonstrates that diverse SFT data effectively facilitates policy selection, while high-difficulty RL data drives policy improvement. The findings not only clarify the distinct roles of SFT and RL data in activating these mechanisms but also offer actionable pathways for enhancing model reasoning performance.
📝 Abstract
Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
reasoning models
post-training
mechanistic understanding
strategy selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

strategy selection
strategy improvement
reinforcement learning
reasoning models
post-training