Score: 2

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Published: September 18, 2025 | arXiv ID: 2509.14804v1

By: Mingchen Shao , Bingshen Mu , Chengyou Wang and more

Potential Business Impact:

Helps computers understand Thai speech better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such as Thai. This limitation arises from three factors: (1) existing commonly used speech encoders, like the Whisper family, underperform in low-resource languages and lack support for broader spoken language understanding tasks; (2) the ASR-based alignment paradigm requires training the entire SLLM, leading to high computational cost; (3) paired speech-text data in low-resource languages is scarce. To overcome these challenges in the low-resource language Thai, we introduce XLSR-Thai, the first self-supervised learning (SSL) speech encoder for Thai. It is obtained by continuously training the standard SSL XLSR model on 36,000 hours of Thai speech data. Furthermore, we propose U-Align, a speech-text alignment method that is more resource-efficient and multitask-effective than typical ASR-based alignment. Finally, we present Thai-SUP, a pipeline for generating Thai spoken language understanding data from high-resource languages, yielding the first Thai spoken language understanding dataset of over 1,000 hours. Multiple experiments demonstrate the effectiveness of our methods in building a Thai multitask-understanding SLLM. We open-source XLSR-Thai and Thai-SUP to facilitate future research.

Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

Audio and Speech Processing

Helps computers understand quiet or rare languages.

7 Aug 2025 2

90%

SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

Computation and Language

Lets computers understand spoken words for tasks.

29 Aug 2025 0

89%

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Computation and Language

Helps computers understand many languages better.

12 Aug 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com huggingface.co

Page Count

5 pages

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Helps computers understand Thai speech better.

Technical Abstract

Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages