GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval
By: Bowen Yang , Yun Cao , Chen He and more
Potential Business Impact:
Finds video clips from spoken words.
Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other -- fusion reduces modality gaps while perturbation regularizes cross-modal matching -- yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.
Similar Papers
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
CV and Pattern Recognition
Finds videos using sound and words better.
Training-Free Multimodal Guidance for Video to Audio Generation
Machine Learning (CS)
Makes silent videos talk with realistic sounds.
DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion
Sound
Separates sounds from videos better.