🤖 AI Summary
Detecting AI-generated text and tracing training data remain challenging due to the lack of theoretically grounded statistical criteria. Method: This paper establishes, for the first time without assuming stationarity or other strong regularity conditions, the asymptotic equipartition property (AEP) for the log-perplexity of long sequences generated by language models—proving its almost-sure convergence to the average entropy of the token distribution and defining the minimal typical set. Contribution/Results: The theory reveals that synthetic texts necessarily reside in an exponentially sparse typical set, exhibiting intrinsic statistical rigidity. Leveraging information-theoretic analysis, the law of large numbers, and AEP derivation—and empirically validating on open-source LLMs (e.g., Llama, Phi)—we confirm stable convergence of perplexity and exponential decay of typical-set measure with sequence length. This work provides the first verifiable, theoretically guaranteed statistical criterion for AI-text detection and training-data auditing.
📝 Abstract
We prove a new asymptotic equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a"typical set"that all long synthetic texts generated by a language model must belong to. We show that this typical set is a vanishingly small subset of all possible grammatically correct outputs. These results suggest possible applications to important practical problems such as (a) detecting synthetic AI-generated text, and (b) testing whether a text was used to train a language model. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations.