Score: 1

Which Pieces Does Unigram Tokenization Really Need?

Published: December 14, 2025 | arXiv ID: 2512.12641v1

By: Sander Land, Yuval Pinter

Potential Business Impact:

Makes computer language simpler and faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Computation and Language

Makes computer language tools fair for all languages.

6 Aug 2025 1

87%

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

87%

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇱 Israel

Repos / Data Links

github.com

Page Count

10 pages

Which Pieces Does Unigram Tokenization Really Need?

Makes computer language simpler and faster.

Technical Abstract

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment