DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

To address the challenge of dynamically optimizing dataset mixing ratios during instruction fine-tuning, this paper proposes DynamixSFT—a reinforcement learning–inspired approach grounded in the multi-armed bandit framework. It introduces prior-scaled Boltzmann exploration and a lightweight one-step lookahead reward to enable adaptive assessment of dataset contributions and online updating of mixing policies. Crucially, DynamixSFT preserves the diversity of the original data distribution while ensuring efficient and stable optimization. Experiments on Tulu-v2-mixture (16 datasets) demonstrate a 2.2% average performance gain across 10 benchmark tasks. Visualization analyses further confirm the effectiveness and interpretability of its dynamic adaptation mechanism. The core contribution lies in the systematic integration of RL principles into instruction fine-tuning data mixture optimization—striking a balance between exploration efficiency and distribution fidelity.

Technology Category

Application Category

📝 Abstract

As numerous instruction-tuning datasets continue to emerge during the post-training stage, dynamically balancing and optimizing their mixtures has become a critical challenge. To address this, we propose DynamixSFT, a dynamic and automated method for instruction-tuning dataset mixture optimization. We formulate the problem as a multi-armed bandit setup and introduce a Prior-scaled Boltzmann Exploration that softly anchors the updated sampling distribution to the original dataset proportions, thereby preserving the inherent diversity and coverage of the collection. Sampling probabilities are updated using a lightweight 1-Step Look-ahead Reward, reflecting how much the dataset contributes to improving the model's performance at its current state. When applied to the Tulu-v2-mixture collection comprising 16 instruction-tuning datasets, DynamixSFT achieves up to a 2.2% performance improvement across 10 benchmarks. Furthermore, we provide a comprehensive analysis and visualizations to offer deeper insights into the adaptive dynamics of our method.

Problem

Research questions and friction points this paper is trying to address.

Dynamically balancing instruction-tuning dataset mixtures

Optimizing sampling distribution to preserve dataset diversity

Improving model performance through adaptive dataset selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic automated instruction-tuning dataset mixture optimization

Multi-armed bandit setup with Prior-scaled Boltzmann Exploration

Lightweight 1-Step Look-ahead Reward for sampling updates

🔎 Similar Papers

CompilerDream: Learning a Compiler World Model for General Code Optimization