Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models

📅 2024-05-22

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

Detecting AI-generated text and tracing training data remain challenging due to the lack of theoretically grounded statistical criteria. Method: This paper establishes, for the first time without assuming stationarity or other strong regularity conditions, the asymptotic equipartition property (AEP) for the log-perplexity of long sequences generated by language models—proving its almost-sure convergence to the average entropy of the token distribution and defining the minimal typical set. Contribution/Results: The theory reveals that synthetic texts necessarily reside in an exponentially sparse typical set, exhibiting intrinsic statistical rigidity. Leveraging information-theoretic analysis, the law of large numbers, and AEP derivation—and empirically validating on open-source LLMs (e.g., Llama, Phi)—we confirm stable convergence of perplexity and exponential decay of typical-set measure with sequence length. This work provides the first verifiable, theoretically guaranteed statistical criterion for AI-text detection and training-data auditing.

Technology Category

Application Category

📝 Abstract

We prove a new asymptotic equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a"typical set"that all long synthetic texts generated by a language model must belong to. We show that this typical set is a vanishingly small subset of all possible grammatically correct outputs. These results suggest possible applications to important practical problems such as (a) detecting synthetic AI-generated text, and (b) testing whether a text was used to train a language model. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations.

Problem

Research questions and friction points this paper is trying to address.

Text Discrimination

Language Model

AI Detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perplexity Pattern

Text Authentication

Language Model Training

🔎 Similar Papers

No similar papers found.