🤖 AI Summary
This work addresses the critical challenge of optimizing multi-domain data mixing ratios in large language model training, a problem for which existing methods lack efficiency. The authors formulate this as a bilevel optimization problem and propose TANDEM, a novel framework that transforms it into a single-level optimization with a penalty term by leveraging a twin-network architecture composed of a proxy model and a dynamic reference model. Domain-specific data are dynamically reweighted based on the discrepancy between these two models, prioritizing domains yielding higher performance gains. The approach enjoys theoretical guarantees and is applicable to data-constrained and supervised fine-tuning scenarios. Extensive experiments demonstrate that TANDEM consistently and significantly enhances model performance across diverse settings, confirming its effectiveness and robustness.
📝 Abstract
The capabilities of large language models (LLMs) significantly depend on training data drawn from various domains. Optimizing domain-specific mixture ratios can be modeled as a bi-level optimization problem, which we simplify into a single-level penalized form and solve with twin networks: a proxy model trained on primary data and a dynamically updated reference model trained with additional data. Our proposed method, Twin Networks for bi-level DatA mixturE optiMization (TANDEM), measures the data efficacy through the difference between the twin models and up-weights domains that benefit more from the additional data. TANDEM provides theoretical guarantees and wider applicability, compared to prior approaches. Furthermore, our bi-level perspective suggests new settings to study domain reweighting such as data-restricted scenarios and supervised fine-tuning, where optimized mixture ratios significantly improve the performance. Extensive experiments validate TANDEM's effectiveness in all scenarios.