Score: 0

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Published: December 8, 2025 | arXiv ID: 2512.07168v1

By: Georgios Ioannides , Christos Constantinou , Aman Chadha and more

Potential Business Impact:

Makes computers understand speech better, faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Machine Learning (CS)

Helps computers understand how likely things are.

7 Oct 2025 1

90%

SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

Machine Learning (CS)

Makes AI understand pictures better and more clearly.

22 Apr 2025 0

90%

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Sound

Makes computers understand sounds faster and better.

27 Sep 2025 1

View PDF Login to Bookmark

Page Count

17 pages

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Makes computers understand speech better, faster.

Technical Abstract

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms