🤖 AI Summary
This study investigates whether natural language exhibits cross-scale statistical regularities—particularly turbulence-like spectral scaling—in the embedding space of Transformer models. Treating text as a high-dimensional trajectory in embedding space, the authors quantify scale-dependent fluctuations of token sequences and analyze their power spectra. They report, for the first time, a robust 5/3 power-law spectrum in contextual representations, analogous to Kolmogorov’s turbulence spectrum, suggesting that semantic information is integrated across scales in a scale-free, self-similar manner. This phenomenon is consistently observed across multiple languages and in both human- and AI-generated texts, yet vanishes in static embeddings or shuffled sequences, underscoring the critical role of dynamic contextual structure in shaping these statistical properties.
📝 Abstract
Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to $5/3$ over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.