Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
By: Dogucan Yaman , Seymanur Akti , Fevziye Irem Eyiokur and more
Potential Business Impact:
Makes talking robots look and sound real.
We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
Similar Papers
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering
CV and Pattern Recognition
Makes any text speak with a realistic face.
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
Sound
Makes faces talk with any voice.
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
CV and Pattern Recognition
Makes videos match sounds and words perfectly.