Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models
By: Vladimir Berman
Potential Business Impact:
Explains why words appear often or rarely.
We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.
Similar Papers
The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework
Methodology
Explains how word parts make words and their patterns.
Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
Methodology
Explains why some words are common, others rare.
From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis
Information Theory
Makes AI understand language better by finding patterns.