Score: 1

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

Published: July 25, 2025 | arXiv ID: 2507.19225v1

By: Fang Kang, Yin Cao, Haoyu Chen

Potential Business Impact:

Makes faces talk with any voice.

Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \& Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

CV and Pattern Recognition

Makes faces talk realistically from sound.

28 Jul 2025 1

90%

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

CV and Pattern Recognition

Makes any text speak with a realistic face.

4 Aug 2025 0

90%

AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective

CV and Pattern Recognition

Makes talking videos from one picture.

15 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇫🇮 Finland

Page Count

5 pages

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

Makes faces talk with any voice.

Technical Abstract

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective