Score: 1

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Published: July 16, 2025 | arXiv ID: 2507.11967v1

By: Yuchi Ishikawa , Shota Nakada , Hokuto Munakata and more

Potential Business Impact:

Teaches computers to understand sound, sight, and words.

Business Areas:

Guides Media and Entertainment

In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Multimedia

Lets computers understand sounds and pictures together.

2 May 2025 2

91%

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

CV and Pattern Recognition

Teaches computers to understand videos better.

8 Feb 2025 1

91%

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

CV and Pattern Recognition

Helps computers understand long videos better.

4 Apr 2025 1

View PDF Login to Bookmark

Page Count

5 pages

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Teaches computers to understand sound, sight, and words.

Technical Abstract

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders