Training Language Models with homotokens Leads to Delayed Overfitting
By: Adrian Cosma , Stefan Ruseti , Emilian Radoi and more
Potential Business Impact:
Makes AI understand words better, even with different spellings.
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.
Similar Papers
The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
Computation and Language
Teaches computers to understand and create new medicines.
Contextual morphologically-guided tokenization for Latin encoder models
Computation and Language
Helps computers understand old languages better.
Token Homogenization under Positional Bias
Computation and Language
Makes AI understand words better by fixing how it sees them.