Score: 0

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan

Published: January 10, 2026 | arXiv ID: 2601.06400v1

By: Sebastian Nehrdich, Kurt Keutzer

Ancient Buddhist literature features frequent, yet often unannotated, textual parallels spread across diverse languages: Sanskrit, Pāli, Buddhist Chinese, Tibetan, and more. The scale of this material makes manual examination prohibitive. We present the MITRA framework, which consists of a novel pipeline for multilingual parallel passage mining, MITRA-parallel, a large-scale corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan, and the development of the domain-specific pretrained language model Gemma 2 MITRA. We present Gemma 2 MITRA-MT, a version of this base model fine-tuned on machine translation tasks, reaching state-of-the-art performance for machine translation of these languages into English and outperforming even much larger open-source models. We also present Gemma 2 MITRA-E, a semantic embedding model that shows state-of-the-art performance on a novel, detailed semantic embedding benchmark. We make the parallel dataset, model weights, and semantic similarity benchmark openly available to aid both NLP research and philological studies in Buddhist and classical Asian literature.

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

Computation and Language

Makes computers translate 28 languages perfectly.

4 Feb 2025 4

87%

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Computation and Language

Helps computers understand many more languages.

31 May 2025 2

87%

MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Computation and Language

Helps computers understand Marathi sentences better.

24 Aug 2025 2

View PDF Login to Bookmark

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan

Technical Abstract

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models