Score: 0

SLAP: Learning Speaker and Health-Related Representations from Natural Language Supervision

Published: October 2, 2025 | arXiv ID: 2510.01860v1

By: Angelika Ando , Auguste Crabeil , Adrien Lesage and more

Potential Business Impact:

Lets computers understand health from voices.

Business Areas:
Speech Recognition Data and Analytics, Software

Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.

Page Count
5 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing