Score: 3

VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task

Published: November 27, 2025 | arXiv ID: 2511.22229v1

By: Yuyue Wang , Xin Cheng , Yihan Wu and more

Potential Business Impact:

Makes videos talk with matching lip movements.

Business Areas:

Speech Recognition Data and Analytics, Software

The task of Visual Text-to-Speech (VisualTTS), also known as video dubbing, aims to generate speech synchronized with the lip movements in an input video, in additional to being consistent with the content of input text and cloning the timbre of a reference speech. Existing VisualTTS models typically adopt lightweight architectures and design specialized modules to achieve the above goals respectively, yet the speech quality is not satisfied due to the model capacity and the limited data in VisualTTS. Recently, speech large language models (SpeechLLM) show the robust ability to generate high-quality speech. But few work has been done to well leverage temporal cues from video input in generating lip-synchronized speech. To generate both high-quality and lip-synchronized speech in VisualTTS tasks, we propose a novel Visual Speech Language Model called VSpeechLM based upon a SpeechLLM. To capture the synchronization relationship between text and video, we propose a text-video aligner. It first learns fine-grained alignment between phonemes and lip movements, and then outputs an expanded phoneme sequence containing lip-synchronization cues. Next, our proposed SpeechLLM based decoders take the expanded phoneme sequence as input and learns to generate lip-synchronized speech. Extensive experiments demonstrate that our VSpeechLM significantly outperforms previous VisualTTS methods in terms of overall quality, speaker similarity, and synchronization metrics.

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

CV and Pattern Recognition

Makes any text speak with a realistic face.

4 Aug 2025 0

89%

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Computation and Language

Makes computer voices sound more real.

22 Aug 2025 1

89%

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Computation and Language

Makes computers talk with real-life facial expressions.

22 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇺🇸 China, United States

Repos / Data Links

github.com

Page Count

8 pages

VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task

Makes videos talk with matching lip movements.

Technical Abstract

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation