SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
By: Kaidi Wang , Yi He , Wenhao Guan and more
Potential Business Impact:
Makes videos speak in any language, perfectly synced.
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
Similar Papers
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
Multimedia
Makes videos talk and have background sounds.
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
Multimedia
Makes videos talk with matching lip movements.
Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization
Sound
Makes dubbed videos match the original speaking time.