Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Supervised fine-tuning (SFT) of large language models suffers from high-variance importance sampling and training instability due to distributional mismatch between the behavior policy (e.g., outputs from a teacher or initial model) and the target policy in off-policy learning. To address this, we propose Guided Re-solving: a data rewriting framework that actively narrows the policy gap by regenerating erroneous responses to better align with the target policy’s distribution—thereby preemptively harmonizing the training distribution. Our method integrates off-policy learning, importance sampling, KL regularization, and a dynamic rewriting mechanism. Evaluated on five mathematical reasoning benchmarks, Guided Re-solving significantly outperforms standard SFT and state-of-the-art direct fine-tuning (DFT) approaches. Results demonstrate substantial improvements in training stability, variance reduction, and generalization capability, validating both its effectiveness and conceptual novelty.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem, where expert demonstrations come from a fixed behavior policy while training aims to optimize a target policy. Importance sampling is the standard tool for correcting this distribution mismatch, but large policy gaps lead to high variance and training instability. Existing approaches mitigate this issue using KL penalties or clipping, which passively constrain updates rather than actively reducing the gap. We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap by keeping correct solutions as on-policy data and rewriting incorrect ones with guided re-solving, falling back to expert demonstrations only when needed. This aligns the training distribution with the target policy before optimization, reducing importance sampling variance and stabilizing off-policy fine-tuning. Experiments on five mathematical reasoning benchmarks demonstrate consistent and significant gains over both vanilla SFT and the state-of-the-art Dynamic Fine-Tuning (DFT) approach. The data and code will be released at https://github.com/NKU-HLT/Off-Policy-SFT.

Problem

Research questions and friction points this paper is trying to address.

Corrects distribution mismatch in off-policy supervised fine-tuning

Reduces variance from large policy gaps during training

Stabilizes optimization by proactively rewriting incorrect solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data rewriting framework shrinks policy gap

Guided re-solving corrects incorrect solutions

Aligns training distribution before optimization

🔎 Similar Papers

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion