DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
By: Wenjie Tian , Xinfa Zhu , Haohe Liu and more
Potential Business Impact:
Makes videos talk and have background sounds.
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.
Similar Papers
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Audio and Speech Processing
Makes videos speak in any language, perfectly synced.
Training-Free Multimodal Guidance for Video to Audio Generation
Machine Learning (CS)
Makes silent videos talk with realistic sounds.
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
CV and Pattern Recognition
Makes videos speak with matching faces.