Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning
By: Shaoshi Ling , Gang Liu , Guoli Ye and more
Potential Business Impact:
Makes computers understand spoken words better.
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
Similar Papers
Explore the Reinforcement Learning for the LLM based ASR and TTS system
Sound
Makes talking computers understand and speak better.
Video Summarization with Large Language Models
CV and Pattern Recognition
Makes video summaries understand stories better.
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Artificial Intelligence
Teaches AI to understand pictures and words together.