Score: 2

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Published: May 2, 2025 | arXiv ID: 2505.01237v2

By: Edson Araujo , Andrew Rouditchenko , Yuan Gong and more

Potential Business Impact:

Lets computers understand sounds and pictures together.

Business Areas:

Motion Capture Media and Entertainment, Video

Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.

SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

Artificial Intelligence

Makes videos of people talking match sound.

11 Oct 2025 1

89%

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

CV and Pattern Recognition

Helps computers understand people talking and acting.

24 Aug 2025 2

89%

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

CV and Pattern Recognition

Helps computers understand long videos better.

4 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Repos / Data Links

github.com

Page Count

14 pages

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Lets computers understand sounds and pictures together.

Technical Abstract

SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders