Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
By: Yangui Fang , Jing Peng , Xu Li and more
Potential Business Impact:
Teaches computers to understand new accents using only text.
Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.
Similar Papers
Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data
Sound
Teaches computers to understand speech better with less data.
Customizing Speech Recognition Model with Large Language Model Feedback
Computation and Language
Helps computers understand rare words in speech.
Self-Improvement for Audio Large Language Model using Unlabeled Speech
Sound
Improves voice AI without needing new recordings.