MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
By: Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee
Potential Business Impact:
Teaches computers to understand emotions in voices.
Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.
Similar Papers
Self-Improvement for Audio Large Language Model using Unlabeled Speech
Sound
Improves voice AI without needing new recordings.
Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data
Sound
Teaches computers to understand speech better with less data.
Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion
Sound
Helps computers understand feelings in voices better.