Closing the Modality Reasoning Gap for Speech Large Language Models
By: Chaoren Wang , Heng Lu , Xueyao Zhang and more
Potential Business Impact:
Makes computers understand spoken words as well as written ones.
Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
Similar Papers
Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Computation and Language
Helps computers understand spoken words better.
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
Computation and Language
Tests if computers understand spoken words like humans do.
Closing the Gap Between Text and Speech Understanding in LLMs
Computation and Language
Makes computers understand spoken words better.