AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing the challenge of designing effective multi-task data mixing strategies for large language model pretraining—where the relationship between data composition and emergent task capabilities is complex and difficult to model—this paper proposes Checkpoint-as-Mixer (CaM). CaM treats intermediate checkpoint models as “living data signal sources” that encode stage-wise capability evolution. It introduces the first unsupervised, dynamic data reweighting and mixing framework by leveraging fine-grained capability evaluations on standardized benchmarks and gradient-level first-order influence approximation (FOIA). Crucially, CaM requires no human annotations or task-specific priors, automatically aligning data distribution with the model’s evolving learning needs. Evaluated across eight reasoning benchmarks, CaM achieves an average improvement of 1.93%, demonstrating substantial gains in pretraining data quality and cross-task generalization.

Technology Category

Application Category

📝 Abstract

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data mixtures for multi-task language models

Leveraging checkpoint artifacts as data mixers

Improving pretraining performance via checkpoint influence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilize checkpoint models as data mixers

Aggregate first-order influence for data

Enhance pretraining with checkpoint artifacts

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining

2024-03-28arXiv.orgCitations: 17

Authors to Follow