Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
By: Ludovic Tuncay , Etienne Labbé , Emmanouil Benetos and more
Potential Business Impact:
Teaches computers to understand sounds with less data.
Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no hyper-parameter tuning. All code and pretrained checkpoints will be released on GitHub.
Similar Papers
WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms
Sound
Makes computers understand sounds faster and better.
JEPA for RL: Investigating Joint-Embedding Predictive Architectures for Reinforcement Learning
CV and Pattern Recognition
Teaches robots to learn from watching.
SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures
Machine Learning (CS)
Makes AI understand pictures better and more clearly.