Learning to Adapt SFT Data for Better Reasoning Generalization

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the degradation in generalization commonly observed in supervised fine-tuning (SFT) when external expert demonstrations exhibit distributional mismatch with the target model. To mitigate this issue, the authors propose DART, a novel approach that formalizes SFT data adaptation as a learnable demonstration transformation task. Specifically, DART employs reinforcement learning to train a mapper that converts original demonstrations into supervision signals better aligned with the target model’s distribution, which are then used for fine-tuning. This strategy effectively circumvents the negative transfer often induced by direct fine-tuning. Empirical results across multiple models and datasets demonstrate that DART consistently outperforms standard SFT, yielding substantial gains in both reasoning generalization and training efficiency, while also surpassing purely reinforcement learning–based fine-tuning strategies.

📝 Abstract

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

Problem

Research questions and friction points this paper is trying to address.

supervised fine-tuning

distribution mismatch

reasoning generalization

large language models

data adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Adaptation

Supervised Fine-Tuning

Reinforcement Learning