Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
By: Vladimir Berman
Potential Business Impact:
Explains why some words are common, others rare.
Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
Similar Papers
The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework
Methodology
Explains how word parts make words and their patterns.
Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models
Computation and Language
Explains why words appear often or rarely.
From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis
Information Theory
Makes AI understand language better by finding patterns.