Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the inefficiency in current small language models’ post-training for reasoning, where both supervised fine-tuning (SFT) and reinforcement learning (RL) stages treat training data uniformly without accounting for sample difficulty. To overcome this limitation, the authors propose a difficulty-aware SFT-then-RL framework: during SFT, the model focuses on unmastered skills by leveraging a Bridge mechanism that transforms hard examples into learnable teacher reasoning trajectories; in the subsequent RL stage, failed samples undergo Critique-based fine-tuning to generate diagnostic and corrective signals that inform the next SFT round. This approach explicitly delineates the complementary roles of SFT and RL in reasoning skill acquisition and introduces stage-specific data allocation alongside cross-stage协同 optimization. Experiments demonstrate consistent and significant improvements over SFT, distillation, and RL baselines across two small models and five reasoning benchmarks, validating the efficacy of difficulty-coordinated training.

📝 Abstract

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.

Problem

Research questions and friction points this paper is trying to address.

Small Language Models

Supervised Fine-Tuning

Reinforcement Learning

Reasoning

Data Strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

stage-specific data

difficulty-aware training

Bridge mechanism