Score: 1

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Published: November 7, 2025 | arXiv ID: 2511.05432v1

By: Dogucan Yaman , Seymanur Akti , Fevziye Irem Eyiokur and more

Potential Business Impact:

Makes talking robots look and sound real.

Business Areas:
Speech Recognition Data and Analytics, Software

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Repos / Data Links

Page Count
5 pages

Category
Computer Science:
CV and Pattern Recognition