π€ AI Summary
This work addresses the tendency of conventional supervised fine-tuning (SFT) to overfit to the superficial patterns of a single expert trajectory, thereby stifling the modelβs autonomous reasoning capabilities. To mitigate this, the authors propose a policy-aware SFT framework that dynamically modulates the strength of expert supervision based on estimated problem solvability derived from in-policy rollouts. When the model demonstrates reliability, it incorporates its own correctly generated reasoning paths into training. The approach further employs difficulty-aware adaptive supervision weights and integrates clipped reverse KL regularization against a frozen reference model to constrain policy drift while preserving effective reasoning priors. Evaluated across six mathematical and two code reasoning benchmarks, the method substantially outperforms standard SFT, its variants, and prominent reinforcement learning baselines, yielding significant gains in reasoning performance.
π Abstract
Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.