A path to natural language through tokenisation and transformers

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This study investigates how the depth of Byte Pair Encoding (BPE) tokenization reshapes fundamental statistical properties of natural language—such as Zipf’s law and Heaps’ law—and influences the learning capacity of Transformer models. By theoretically deriving the Zipfian distribution of token frequencies under BPE and the expected slot entropy, and integrating Shannon entropy analysis with Transformer training dynamics, the work establishes the first quantitative link between BPE depth and linguistic statistical regularities. Experimental results demonstrate that recursive application of BPE yields token frequencies that more closely adhere to Zipf’s power law, improves alignment between model-predicted entropy and theoretical expectations, and attenuates local dependencies, driving sequences toward a weakly correlated state. These findings reveal that BPE functions not merely as a compression mechanism but as a statistical transformation that actively restructures the informational architecture of language.

Technology Category

Application Category

📝 Abstract

Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps'laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte--pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.

Problem

Research questions and friction points this paper is trying to address.

tokenisation

Zipf's law

Shannon entropy

byte-pair encoding

statistical structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Byte-Pair Encoding

Zipf's Law

Shannon Entropy

Transformer Models

Tokenisation

🔎 Similar Papers

No similar papers found.

Authors to Follow