Score: 2

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Published: December 8, 2025 | arXiv ID: 2512.07277v1

By: Srihari Bandarupalli , Bhavana Akkiraju , Charan Devarakonda and more

Potential Business Impact:

Lets computers understand rare languages better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.

Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

Artificial Intelligence

Lets computers understand Arabic speech without human help.

16 Apr 2025 1

91%

Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

Computation and Language

Helps computers understand many Arabic accents.

12 Aug 2025 1

90%

Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Computation and Language

Helps computers understand many Indian languages better.

7 Jan 2026 3

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com

Page Count

7 pages

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Lets computers understand rare languages better.

Technical Abstract

Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties