Score: 2

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

Published: July 14, 2025 | arXiv ID: 2507.10109v1

By: Wenjie Tian , Xinfa Zhu , Haohe Liu and more

Potential Business Impact:

Makes videos talk and have background sounds.

Business Areas:
Speech Recognition Data and Analytics, Software

While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.

Country of Origin
🇬🇧 🇨🇳 China, United Kingdom

Page Count
12 pages

Category
Computer Science:
Multimedia