Score: 1

Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

Published: October 7, 2025 | arXiv ID: 2510.06060v1

By: Christian Marinoni , Riccardo Fosco Gramaccioni , Eleonora Grassucci and more

Potential Business Impact:

Makes videos sound and look right from any angle.

Business Areas:

Augmented Reality Hardware, Software

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Graphics

Turns sounds into videos matching noise locations

1 Aug 2025 1

89%

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos

CV and Pattern Recognition

Makes VR videos show where people look.

27 Aug 2025 0

89%

FoleySpace: Vision-Aligned Binaural Spatial Audio Generation

Sound

Makes videos sound like you're really there.

18 Aug 2025 0

View PDF Login to Bookmark

Page Count

5 pages

Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

Makes videos sound and look right from any angle.

Technical Abstract

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos

FoleySpace: Vision-Aligned Binaural Spatial Audio Generation