SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation
By: Yizhou Zhang , Yuan Gao , Wangjin Zhou and more
Potential Business Impact:
Learns new sounds without forgetting old ones.
Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.
Similar Papers
Self-Improvement for Audio Large Language Model using Unlabeled Speech
Sound
Improves voice AI without needing new recordings.
SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection
Sound
Finds fake voices by listening to tiny sound details.
Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training
CV and Pattern Recognition
Teaches computers to understand new pictures better.