SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
By: Amir Dellali , Luca A. Lanzendörfer , Florian Grötschla and more
Potential Business Impact:
Makes silent videos talk with realistic sound.
We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.
Similar Papers
Training-Free Multimodal Guidance for Video to Audio Generation
Machine Learning (CS)
Makes silent videos talk with realistic sounds.
Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm
CV and Pattern Recognition
Makes videos match sounds automatically and easily.
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
CV and Pattern Recognition
Makes videos play only the sound you want.