Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) of large language models suffers from high-variance importance sampling and training instability due to distributional mismatch between the behavior policy (e.g., outputs from a teacher or initial model) and the target policy in off-policy learning. To address this, we propose Guided Re-solving: a data rewriting framework that actively narrows the policy gap by regenerating erroneous responses to better align with the target policy’s distribution—thereby preemptively harmonizing the training distribution. Our method integrates off-policy learning, importance sampling, KL regularization, and a dynamic rewriting mechanism. Evaluated on five mathematical reasoning benchmarks, Guided Re-solving significantly outperforms standard SFT and state-of-the-art direct fine-tuning (DFT) approaches. Results demonstrate substantial improvements in training stability, variance reduction, and generalization capability, validating both its effectiveness and conceptual novelty.

Technology Category

Application Category

📝 Abstract
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem, where expert demonstrations come from a fixed behavior policy while training aims to optimize a target policy. Importance sampling is the standard tool for correcting this distribution mismatch, but large policy gaps lead to high variance and training instability. Existing approaches mitigate this issue using KL penalties or clipping, which passively constrain updates rather than actively reducing the gap. We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap by keeping correct solutions as on-policy data and rewriting incorrect ones with guided re-solving, falling back to expert demonstrations only when needed. This aligns the training distribution with the target policy before optimization, reducing importance sampling variance and stabilizing off-policy fine-tuning. Experiments on five mathematical reasoning benchmarks demonstrate consistent and significant gains over both vanilla SFT and the state-of-the-art Dynamic Fine-Tuning (DFT) approach. The data and code will be released at https://github.com/NKU-HLT/Off-Policy-SFT.
Problem

Research questions and friction points this paper is trying to address.

Corrects distribution mismatch in off-policy supervised fine-tuning
Reduces variance from large policy gaps during training
Stabilizes optimization by proactively rewriting incorrect solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data rewriting framework shrinks policy gap
Guided re-solving corrects incorrect solutions
Aligns training distribution before optimization
🔎 Similar Papers
No similar papers found.
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
Xuyang Zhao
Xuyang Zhao
Peking University
statisticsmachine learning
J
Jiaming Zhou
College of Computer Science, Nankai University
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
Q
Qicheng Li
College of Computer Science, Nankai University
Y
Yong Qin
College of Computer Science, Nankai University