Score: 1

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Published: August 22, 2025 | arXiv ID: 2508.16188v2

By: Weiting Tan , Jiachen Lian , Hirofumi Inaguma and more

Potential Business Impact:

Makes computer voices sound more real.

Business Areas:

Virtual World Community and Lifestyle, Media and Entertainment, Software

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Computation and Language

Makes computers talk with real-life facial expressions.

22 Aug 2025 1

90%

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

Multimedia

AI understands feelings better from voices and faces.

8 Oct 2025 0

90%

Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model

CV and Pattern Recognition

Reads emotions from faces in 3D.

28 Apr 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

18 pages

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Makes computer voices sound more real.

Technical Abstract

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model