Listen to the Unexpected: Self-Supervised Surprise Detection for Efficient Viewport Prediction
By: Arman Nik Khah, Ravi Prakash
Potential Business Impact:
Makes watching 360 videos smoother by predicting where you'll look.
Adaptive streaming of 360-degree video relies on viewport prediction to allocate bandwidth efficiently. Current approaches predominantly use visual saliency or historical gaze patterns, neglecting the role of spatial audio in guiding user attention. This paper presents a self-learning framework for detecting "surprising" auditory events -- moments that deviate from learned temporal expectations -- and demonstrates their utility for viewport prediction. The proposed architecture combines $SE(3)$-equivariant graph neural networks with recurrent temporal modeling, trained via a dual self-supervised objective. A key feature is the natural modeling of temporal attention decay: surprise is high at event onset but diminishes as the listener adapts. Experiments on the AVTrack360 dataset show that integrating audio surprise with visual cues reduces bitrate waste by up to 18% compared to visual-only methods.
Similar Papers
Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
Multimedia
Makes videos sound and look right from any angle.
Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
CV and Pattern Recognition
Makes VR videos show where people look.
Investigating self-supervised representations for audio-visual deepfake detection
CV and Pattern Recognition
Finds fake videos by listening and watching.