Score: 1

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Published: October 10, 2025 | arXiv ID: 2510.09032v1

By: Adity Khisa , Nusrat Jahan Lia , Tasnim Mahfuz Nafis and more

Potential Business Impact:

Helps computers understand a rare language better.

Business Areas:

Translation Service Professional Services

As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Computation and Language

Helps computers understand different languages written in English letters.

27 Nov 2025 2

88%

Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

Computation and Language

Translates rare languages better than big AI.

20 Oct 2025 0

88%

From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge

Computation and Language

Helps computers understand Bengali culture better.

22 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇧🇩 🇺🇸 Bangladesh, United States

Page Count

12 pages

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Helps computers understand a rare language better.

Technical Abstract

Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge