Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
By: Haeji Jung , Jinju Kim , Kyungjin Kim and more
Potential Business Impact:
Helps computers understand different languages better.
Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
Similar Papers
Lost in Transliteration: Bridging the Script Gap in Neural IR
Information Retrieval
Helps search engines understand typed foreign words.
Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration
Computation and Language
Helps computers understand different languages written in English letters.
IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages
Computation and Language
Helps type Indian languages correctly on phones.