Score: 1

Which Pieces Does Unigram Tokenization Really Need?

Published: December 14, 2025 | arXiv ID: 2512.12641v1

By: Sander Land, Yuval Pinter

Potential Business Impact:

Makes computer language simpler and faster.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.

Country of Origin
🇮🇱 Israel

Repos / Data Links

Page Count
10 pages

Category
Computer Science:
Computation and Language