Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the differences in knowledge acquisition mechanisms between Mixture-of-Experts (MoE) and dense models during pretraining. By introducing Gated-LPI, a neuron-level attribution method, and combining it with million-step training trajectory analysis and attention head masking experiments, the work reveals three distinctive properties of MoE: a low-entropy backbone structure, early knowledge consolidation, and functional robustness. The findings show that the top 1% of neurons in MoE contribute over 45% of positive parameter updates, with their importance stabilizing within 100,000 training steps. Moreover, masking critical attention heads results in less than a 10% performance drop—significantly better than in dense models—demonstrating that the sparse architecture enables more stable and distributed knowledge storage.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training, and how this process differs from dense architectures, remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M training steps (~ 5.0T tokens) and 600K training steps (~ 2.5T tokens), respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within<100K steps, whereas the dense model remains volatile throughout training. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by<10%, compared with>50% for the dense model, showing that sparsity fosters distributed -- rather than brittle -- knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
pre-training
knowledge acquisition
dense models
sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
knowledge attribution
Gated-LPI
sparsity
training dynamics
🔎 Similar Papers
No similar papers found.
B
Bo Wang
The Hong Kong University of Science and Technology (Guangzhou)
J
Junzhuo Li
The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
Hong Chen
Hong Chen
The Hong Kong University of Science and Technology (Guangzhou)
Large Language ModelsMulti-modal LLMsEfficient LLMs
Y
Yuanlin Chu
The Hong Kong University of Science and Technology (Guangzhou)
Yuxuan Fan
Yuxuan Fan
Peking University
Natural Language Processing
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model