Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a knowledge-distillation-free method for optimizing domain mixing in pretraining data to align the distribution of a base language model with that of a target model. Treating language models as points in log-likelihood space, the approach dynamically adjusts the mixing weights of data domains by minimizing the Kullback–Leibler (KL) divergence between the base and target models, thereby steering model updates toward the target distribution. Experiments based on the NanoGPT framework demonstrate that, compared to uniform sampling from the Pile dataset, the proposed method substantially reduces distributional discrepancy with the target model and yields downstream task performance markedly closer to that of the target. This study offers a novel, distribution-alignment-based perspective on data recipe design for language model pretraining.

Technology Category

Application Category

📝 Abstract
Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.
Problem

Research questions and friction points this paper is trying to address.

domain mixture
model alignment
log-likelihood
language models
distribution alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain mixture design
log-likelihood differences
model alignment
KL divergence
continued pretraining
🔎 Similar Papers
No similar papers found.
R
Ryo Kishino
Kyoto University
R
Riku Shiomi
Kyoto University
Hiroaki Yamagiwa
Hiroaki Yamagiwa
Assistant Professor, Kyoto University
natural language processingembeddings
M
Momose Oyama
Kyoto University, RIKEN
Hidetoshi Shimodaira
Hidetoshi Shimodaira
Kyoto University
StatisticsMachine Learning