Self-Vocabularizing Training for Neural Machine Translation
By: Pin-Jie Lin , Ernie Chang , Yangyang Shi and more
Potential Business Impact:
Teaches computers to learn words better for translation.
Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.
Similar Papers
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Computation and Language
Lets computers understand text at different sizes.
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Computation and Language
Teaches computers new languages using dictionaries.
Overcoming Vocabulary Constraints with Pixel-level Fallback
Computation and Language
Helps computers understand any language, even new ones.