Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates whether Martin’s Law—the empirical observation that higher-frequency words exhibit greater polysemy—emerges during neural language model training, and how this relationship evolves with model scale. Method: We quantify word sense count via DBSCAN clustering of contextualized word embeddings, systematically tracking semantic structure dynamics across 30 training checkpoints of the Pythia family (70M–1B parameters). Contribution/Results: We find a non-monotonic emergence of Martin’s Law, characterized by a “semantic optimal window” where the frequency–polysemy correlation peaks mid-training. Larger models exhibit earlier onset of semantic degradation yet maintain a more robust frequency–specificity trade-off. Crucially, we introduce the first evaluation paradigm for language structure emergence grounded in the dynamic trajectory of Martin’s Law, providing a quantifiable, scale-invariant framework to analyze the development of semantic competence in large language models.

Technology Category

Application Category

📝 Abstract

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

Problem

Research questions and friction points this paper is trying to address.

Investigating Martin's Law emergence in neural language models during training

Analyzing frequency-polysemy relationship across model sizes and checkpoints

Evaluating semantic collapse patterns in different scale language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using DBSCAN clustering of contextualized embeddings

Analyzing four Pythia models across training checkpoints

Establishing methodology for evaluating emergent linguistic structure

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval