SLAP: Learning Speaker and Health-Related Representations from Natural Language Supervision
By: Angelika Ando , Auguste Crabeil , Adrien Lesage and more
Potential Business Impact:
Lets computers understand health from voices.
Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.
Similar Papers
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Audio and Speech Processing
Teaches computers to understand all sounds.
GLAP: General contrastive audio-text pretraining across domains and languages
Sound
Lets computers understand sounds in many languages.
Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions
Sound
Helps computers know where sounds come from.