Score: 0

HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

Published: September 23, 2025 | arXiv ID: 2509.18570v1

By: Yuke Si , Runyan Yang , Yingying Gao and more

Potential Business Impact:

Helps computers understand speech and feelings better.

Business Areas:

Speech Recognition Data and Analytics, Software

Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integration of both linguistic and paralinguistic cues. Existing multitask SLMs typically adopt naive parameter sharing or prompt-based conditioning without explicitly modeling the differences in information composition required by each task. Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. HarmoniFuse is designed to harmonize heterogeneous task demands by selecting and fusing task-relevant components of speech representations. Specifically, it integrates a gated speech encoder to extract task-specific acoustic features and a prompt-adaptive dynamic fusion module to aggregate transformer layers based on task characteristics. In addition, a batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation. Experimental results demonstrate that HarmoniFuse improves both ASR and SER performance, offering a scalable and robust solution for multitask speech understanding under realistic data constraints.

FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge

Sound

Cleans up messy sounds to make voices clear.

1 Jun 2025 0

87%

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Computation and Language

Teaches computers to understand emotions in voices.

25 Sep 2025 0

87%

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

Sound

Helps computers understand feelings in voices better.

25 Aug 2025 0

View PDF Login to Bookmark

Page Count

5 pages

HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

Helps computers understand speech and feelings better.

Technical Abstract

FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion