Score: 1

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Published: August 22, 2025 | arXiv ID: 2508.16188v2

By: Weiting Tan , Jiachen Lian , Hirofumi Inaguma and more

Potential Business Impact:

Makes computer voices sound more real.

Business Areas:
Virtual World Community and Lifestyle, Media and Entertainment, Software

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.


Page Count
18 pages

Category
Computer Science:
Computation and Language