mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in multi-task supervised fine-tuning where a uniform compute budget often leads to overfitting on fast-learning tasks and underfitting on slow-learning ones. To mitigate this imbalance, the authors propose mSFT, an algorithm that introduces, for the first time, an overfitting-aware dynamic data mixing strategy. By iteratively identifying the earliest-overfitting subtask, rolling back to its optimal checkpoint, and adaptively adjusting the data mixture ratios for subsequent training, mSFT effectively balances task-specific learning dynamics. The method requires no sensitive hyperparameters and consistently outperforms four strong baselines across ten benchmarks and six foundation models, achieving higher performance while reducing FLOPs—particularly under limited compute budgets.

Technology Category

Application Category

📝 Abstract
Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.
Problem

Research questions and friction points this paper is trying to address.

multi-task SFT
dataset mixtures
overfitting
heterogeneous learning dynamics
compute budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task SFT
overfitting-aware
heterogeneous learning dynamics
iterative mixture optimization
compute-efficient training
🔎 Similar Papers
Woosung Koh
Woosung Koh
Trillion Labs, KAIST AI
Foundation ModelsAgents
J
Jeyoung Jeon
Yonsei University
Y
Youngjin Song
Yonsei University
Y
Yujin Cheon
S
Soowon Oh
KAIST AI, Samsung Electronics
J
Jaehyeong Choi
Yonsei University
S
Se-Young Yun
KAIST AI