Score: 0

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Published: November 23, 2025 | arXiv ID: 2512.05126v1

By: Kaidi Wang , Yi He , Wenhao Guan and more

Potential Business Impact:

Makes videos speak in any language, perfectly synced.

Business Areas:
Speech Recognition Data and Analytics, Software

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

Country of Origin
🇨🇳 China

Page Count
5 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing