Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
By: Apoorv Vyas , Heng-Jui Chang , Cheng-Fu Yang and more
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
Similar Papers
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
Multimedia
Changes video sounds using pictures and words.
Aligned Better, Listen Better for Audio-Visual Large Language Models
CV and Pattern Recognition
Helps computers understand videos by listening.
Can Sound Replace Vision in LLaVA With Token Substitution?
Multimedia
Makes computers understand sounds and pictures better.