Quadratic Term Correction on Heaps' Law
By: Oscar Fontanelli, Wentian Li
Potential Business Impact:
Makes computer language models understand words better.
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
Similar Papers
From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis
Information Theory
Makes AI understand language better by finding patterns.
Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings
Physics and Society
Explains how word counts predict new words.
Learning curves theory for hierarchically compositional data with power-law distributed features
Machine Learning (Stat)
Makes AI learn faster by understanding how things are built.