When Mean CE Fails: Median CE Can Better Track Language Model Quality

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the frequent misalignment between standard mean cross-entropy (Mean CE) and actual task performance in language model training, particularly during fine-tuning and knowledge distillation. By analyzing the dynamics of per-token cross-entropy distributions, the study proposes median cross-entropy (Median CE) and quantile-based summaries as more accurate proxies for model quality. Empirical evaluations on Qwen2.5-1.5B and TinyStories demonstrate that Median CE exhibits significantly stronger agreement with both human and large language model assessments, outperforming Mean CE in tasks involving factual recall and story generation. This research is the first to systematically reveal the decoupling between Mean CE and true model capabilities and establishes Median CE as a more reliable metric for training monitoring, offering a novel perspective for model evaluation and knowledge distillation.

📝 Abstract

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

Problem

Research questions and friction points this paper is trying to address.

mean cross-entropy

language model evaluation

validation metric

median cross-entropy

model quality tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Median Cross-Entropy

Language Model Evaluation

Cross-Entropy Distribution