Score: 0

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Published: August 27, 2025 | arXiv ID: 2509.14238v1

By: Jinfan Frank Hu

Potential Business Impact:

Makes computers understand languages better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.