Entropy and type-token ratio in gigaword corpora

📅 2024-11-15
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the quantitative relationship between lexical diversity—measured by word entropy and type-token ratio (TTR)—across large-scale corpora (≈1 billion tokens each) of English, Spanish, and Turkish, spanning books, news, and tweets. Using word-frequency distributions, information-theoretic entropy computation, and power-law fitting, we empirically establish, for the first time across multiple languages and genres at scale, a highly consistent negative functional relationship between word entropy and TTR (R² > 0.99). By integrating Zipf’s law and Heaps’ law, we derive an analytical expression for this relationship in the asymptotic limit of large texts. The resulting function proves cross-linguistically invariant, revealing a universal scaling law governing lexical diversity. This work provides a unified theoretical framework for modeling linguistic diversity and establishes a testable, mathematically grounded metric foundation for quantitative language analysis.

Technology Category

Application Category

📝 Abstract
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
Problem

Research questions and friction points this paper is trying to address.

Measure lexical diversity in languages
Analyze entropy and type-token ratio
Examine large text corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lexical diversity metrics analyzed
Empirical functional relation unveiled
Analytical expression derived
🔎 Similar Papers
No similar papers found.
P
P. Rosillo-Rodes
Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
M
Maxi San Miguel
Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
David Sanchez
David Sanchez
Serra Hunter Professor and ICREA-Acadèmia Researcher at Universitat Rovira i Virgili (URV)
SemanticsData privacyMachine learning