Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
By: Hyeonggon Ryu , Seongyu Kim , Joon Son Chung and more
Potential Business Impact:
Lets computers understand mixed sounds and sights.
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.
Similar Papers
SeeingSounds: Learning Audio-to-Visual Alignment via Text
Sound
Makes pictures from sounds without seeing them.
Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator
CV and Pattern Recognition
Makes pictures from many sounds at once.
Learning to Highlight Audio by Watching Movies
CV and Pattern Recognition
Makes videos sound better by watching them.