🤖 AI Summary
This work addresses the challenge in multi-task supervised fine-tuning where a uniform compute budget often leads to overfitting on fast-learning tasks and underfitting on slow-learning ones. To mitigate this imbalance, the authors propose mSFT, an algorithm that introduces, for the first time, an overfitting-aware dynamic data mixing strategy. By iteratively identifying the earliest-overfitting subtask, rolling back to its optimal checkpoint, and adaptively adjusting the data mixture ratios for subsequent training, mSFT effectively balances task-specific learning dynamics. The method requires no sensitive hyperparameters and consistently outperforms four strong baselines across ten benchmarks and six foundation models, achieving higher performance while reducing FLOPs—particularly under limited compute budgets.
📝 Abstract
Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.