Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing analyses fail to disentangle how statistical regularities (fluency) and factual knowledge interact within pretrained language models to affect generalization. Method: We propose a controllable synthetic dual-stream framework—comprising statistically fluent and factually grounded streams—to isolate these factors under controlled conditions, systematically varying contextual structure and diversity. Through targeted model component interventions and multi-stage analysis, we examine their distinct roles in in-distribution convergence and out-of-distribution factual reasoning. Contribution/Results: We uncover a nonlinear relationship between diversity and generalization: high diversity delays in-distribution convergence but substantially enhances out-of-distribution factual inference, enabling nontrivial factual recall under specific structural configurations. Crucially, we identify the embedding layer as the key bottleneck for joint statistical-factual generalization and pinpoint distinct architectural factors responsible for fluency versus factual failure. This work establishes a novel paradigm for investigating knowledge representation and generalization mechanisms in language models.

Technology Category

Application Category

📝 Abstract

Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how pretraining diversity affects language model generalization capabilities

Investigating the interplay between statistical regularities and factual associations in models

Identifying optimization bottlenecks causing out-of-distribution factual recall failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic testbed combines statistical and factual token streams

Independent control of contextual structure and diversity levels

Traces generalization failures to embedding and unembedding layers

🔎 Similar Papers

No similar papers found.