Score: 1

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Published: September 27, 2025 | arXiv ID: 2509.23238v1

By: Goksenin Yuksel , Pierre Guetschel , Michael Tangermann and more

Potential Business Impact:

Makes computers understand sounds faster and better.

Business Areas:

Semantic Web Internet Services

Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Sound

Makes computers understand speech better, faster.

8 Dec 2025 0

89%

SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

Machine Learning (CS)

Makes AI understand pictures better and more clearly.

22 Apr 2025 0

89%

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Machine Learning (CS)

Teaches AI to learn from the world better.

11 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇳🇱 Netherlands

Page Count

18 pages

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Makes computers understand sounds faster and better.

Technical Abstract

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics