Score: 0

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Published: April 14, 2025 | arXiv ID: 2504.10335v1

By: Maharaj Brahma , N J Karthika , Atul Singh and more

Potential Business Impact:

Helps computers understand languages better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Computation and Language

Helps computers understand Indian languages better.

23 Apr 2025 1

91%

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

91%

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

View PDF Login to Bookmark

Page Count

13 pages

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Helps computers understand languages better.

Technical Abstract

Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment