Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish
By: Jinfan Frank Hu
Potential Business Impact:
Makes computers understand languages better.
Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.
Similar Papers
Subword Tokenization Strategies for Kurdish Word Embeddings
Computation and Language
Helps computers understand Kurdish words better.
Comparative analysis of subword tokenization approaches for Indian languages
Computation and Language
Helps computers translate Indian languages better.
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Computation and Language
Helps computers understand Indian languages better.