Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models

📅 2024-05-22
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

250K/year
🤖 AI Summary
Detecting AI-generated text and tracing training data remain challenging due to the lack of theoretically grounded statistical criteria. Method: This paper establishes, for the first time without assuming stationarity or other strong regularity conditions, the asymptotic equipartition property (AEP) for the log-perplexity of long sequences generated by language models—proving its almost-sure convergence to the average entropy of the token distribution and defining the minimal typical set. Contribution/Results: The theory reveals that synthetic texts necessarily reside in an exponentially sparse typical set, exhibiting intrinsic statistical rigidity. Leveraging information-theoretic analysis, the law of large numbers, and AEP derivation—and empirically validating on open-source LLMs (e.g., Llama, Phi)—we confirm stable convergence of perplexity and exponential decay of typical-set measure with sequence length. This work provides the first verifiable, theoretically guaranteed statistical criterion for AI-text detection and training-data auditing.

Technology Category

Application Category

📝 Abstract
We prove a new asymptotic equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a"typical set"that all long synthetic texts generated by a language model must belong to. We show that this typical set is a vanishingly small subset of all possible grammatically correct outputs. These results suggest possible applications to important practical problems such as (a) detecting synthetic AI-generated text, and (b) testing whether a text was used to train a language model. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations.
Problem

Research questions and friction points this paper is trying to address.

Text Discrimination
Language Model
AI Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perplexity Pattern
Text Authentication
Language Model Training
🔎 Similar Papers
No similar papers found.