Which Pieces Does Unigram Tokenization Really Need?
By: Sander Land, Yuval Pinter
Potential Business Impact:
Makes computer language simpler and faster.
The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.
Similar Papers
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Computation and Language
Makes computer language tools fair for all languages.
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Computation and Language
Helps computers understand languages better by breaking words.
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Computation and Language
Helps computers understand languages better by breaking words.