mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge in multi-task supervised fine-tuning where a uniform compute budget often leads to overfitting on fast-learning tasks and underfitting on slow-learning ones. To mitigate this imbalance, the authors propose mSFT, an algorithm that introduces, for the first time, an overfitting-aware dynamic data mixing strategy. By iteratively identifying the earliest-overfitting subtask, rolling back to its optimal checkpoint, and adaptively adjusting the data mixture ratios for subsequent training, mSFT effectively balances task-specific learning dynamics. The method requires no sensitive hyperparameters and consistently outperforms four strong baselines across ten benchmarks and six foundation models, achieving higher performance while reducing FLOPs—particularly under limited compute budgets.

Technology Category

Application Category

📝 Abstract

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

Problem

Research questions and friction points this paper is trying to address.

multi-task SFT

dataset mixtures

overfitting

heterogeneous learning dynamics

compute budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task SFT

overfitting-aware

heterogeneous learning dynamics