Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
By: Christian Marinoni , Riccardo Fosco Gramaccioni , Eleonora Grassucci and more
Potential Business Impact:
Makes videos sound and look right from any angle.
The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.
Similar Papers
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation
Graphics
Turns sounds into videos matching noise locations
Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
CV and Pattern Recognition
Makes VR videos show where people look.
FoleySpace: Vision-Aligned Binaural Spatial Audio Generation
Sound
Makes videos sound like you're really there.