Score: 1

SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Published: December 23, 2025 | arXiv ID: 2512.20308v1

By: Maxime Poli , Mahi Luthra , Youssef Benchekroun and more

Potential Business Impact:

Teaches computers to understand talking without words.

Business Areas:

Semantic Web Internet Services

The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher's intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Audio and Speech Processing

Helps computers understand emotions in spoken words.

15 Jan 2025 2

87%

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

Sound

Helps computers tell who is speaking.

20 Oct 2025 1

87%

Entropy-based Coarse and Compressed Semantic Speech Representation Learning

Computation and Language

Makes computers understand talking with fewer details.

30 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

30 pages

SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Teaches computers to understand talking without words.

Technical Abstract

WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

Entropy-based Coarse and Compressed Semantic Speech Representation Learning