The Token Tax: Systematic Bias in Multilingual Tokenization
By: Jessica M. Lundin , Ada Zhang , Nihal Karim and more
Potential Business Impact:
Helps computers understand languages with many word parts better.
Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
Similar Papers
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Computation and Language
Makes AI work fairly for all languages.
Contextual morphologically-guided tokenization for Latin encoder models
Computation and Language
Helps computers understand old languages better.
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Computation and Language
Makes computers understand many languages faster.