Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video
By: Daniel Adebi, Sagnik Majumder, Kristen Grauman
Potential Business Impact:
Lets cameras know where they are using sound.
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.
Similar Papers
DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes
Sound
Helps microphones hear sounds from any direction.
PAVAS: Physics-Aware Video-to-Audio Synthesis
CV and Pattern Recognition
Makes videos sound real by understanding physics.
Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
Multimedia
Makes videos sound and look right from any angle.