Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection
By: Yunqi Hao , Yihao Chen , Minqiang Xu and more
Potential Business Impact:
Finds fake voices in recordings better.
In recent years, self-supervised learning (SSL) models have made significant progress in audio deepfake detection (ADD) tasks. However, existing SSL models mainly rely on large-scale real speech for pre-training and lack the learning of spoofed samples, which leads to susceptibility to domain bias during the fine-tuning process of the ADD task. To this end, we propose a two-stage learning strategy (Wav2DF-TSL) based on pre-training and hierarchical expert fusion for robust audio deepfake detection. In the pre-training stage, we use adapters to efficiently learn artifacts from 3000 hours of unlabelled spoofed speech, improving the adaptability of front-end features while mitigating catastrophic forgetting. In the fine-tuning stage, we propose the hierarchical adaptive mixture of experts (HA-MoE) method to dynamically fuse multi-level spoofing cues through multi-expert collaboration with gated routing. Experimental results show that the proposed method significantly outperforms the baseline system on all four benchmark datasets, especially on the cross-domain In-the-wild dataset, achieving a 27.5% relative improvement in equal error rate (EER), outperforming the existing state-of-the-art systems. Index Terms: audio deepfake detection, self-supervised learning, parameter-efficient fine-tuning, mixture of experts
Similar Papers
Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures
Sound
Finds fake voices better, even new ones.
KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features
Audio and Speech Processing
Finds fake videos by listening to the sound.
When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection
Sound
Catches fake voices mixed with real ones.