Score: 1

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Published: April 3, 2025 | arXiv ID: 2504.02397v1

By: Boseung Jeong , Jicheol Park , Sungyeon Kim and more

Potential Business Impact:

Finds videos using sound and words better.

Business Areas:

Guides Media and Entertainment

Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval

CV and Pattern Recognition

Finds video clips from spoken words.

3 Aug 2025 2

89%

TA-V2A: Textually Assisted Video-to-Audio Generation

CV and Pattern Recognition

Makes videos talk with matching sounds.

12 Mar 2025 0

88%

Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

CV and Pattern Recognition

Answers questions about video text faster.

6 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Page Count

15 pages

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Finds videos using sound and words better.

Technical Abstract

GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval

TA-V2A: Textually Assisted Video-to-Audio Generation

Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective