UFT: Unifying Supervised and Reinforcement Fine-Tuning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) often causes overfitting in large language models, while reinforcement fine-tuning (RFT) relies heavily on strong base models and suffers from high sample complexity—particularly for long-horizon reasoning tasks. Method: We propose Unified Fine-Tuning (UFT), a novel paradigm that integrates SFT and RFT into a single end-to-end optimization process, enabling simultaneous injection of efficient supervised signals and policy exploration. Contribution/Results: Theoretically, UFT breaks RFT’s exponential sample complexity barrier, achieving exponential convergence acceleration for long-horizon reasoning. Its joint objective jointly optimizes supervised loss and reinforcement targets, maintaining full compatibility with standard RLHF components and SFT data formats. Empirical results demonstrate that UFT consistently outperforms standalone SFT and RFT across model scales, delivering significant gains in accuracy, robustness, and convergence speed on challenging long-horizon tasks—including mathematical reasoning and code generation—without requiring a strong pretrained base model.

Technology Category

Application Category

📝 Abstract
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Unifying supervised and reinforcement fine-tuning for LLMs
Overcoming overfitting and generalization limitations in SFT and RFT
Exponentially accelerating convergence in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies SFT and RFT into one process
Combines exploration with supervised signals
Breaks RFT's exponential complexity bottleneck
🔎 Similar Papers
No similar papers found.