GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of existing data mixing strategies in large language model pretraining, which are hindered by misaligned human-defined taxonomies and embedding anisotropy, impeding optimal data ratio selection. The authors propose a geometric entropy mixing framework that formulates data mixing as a variational problem on the hypersphere, incorporating a mixture-balancing regularizer. By decoupling generative priors and optimizing via a Minorize-Maximize algorithm, the method enables geometry-preserving corpus mixing at scale through teacher–student distillation. For the first time, semantic structures resistant to cluster collapse are discovered within hyperspherical geometry, guided by an interpretable Geometric Influence Score (GIS) for controllable mixing. Integrating DoReMi and RegMix strategies on a 1.1B-parameter model yields up to a 1.2% average accuracy gain on downstream tasks and establishes a predictive coordinate system for data mixing.

📝 Abstract

LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

Problem

Research questions and friction points this paper is trying to address.

data curation

embedding anisotropy

ontological misalignment

data mixing

LLM pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric Entropy Mixing

hyperspherical optimization

embedding anisotropy