Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting
By: Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Similar Papers
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
CV and Pattern Recognition
Lets computers understand sound, video, and words together.
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker
Computation and Language
Helps computers understand work tasks better.
Leveraging Whisper Embeddings for Audio-based Lyrics Matching
Sound
Finds song lyrics from just the music.