Score: 1

Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection

Published: September 4, 2025 | arXiv ID: 2509.04161v1

By: Yunqi Hao , Yihao Chen , Minqiang Xu and more

Potential Business Impact:

Finds fake voices in recordings better.

Business Areas:

Speech Recognition Data and Analytics, Software

In recent years, self-supervised learning (SSL) models have made significant progress in audio deepfake detection (ADD) tasks. However, existing SSL models mainly rely on large-scale real speech for pre-training and lack the learning of spoofed samples, which leads to susceptibility to domain bias during the fine-tuning process of the ADD task. To this end, we propose a two-stage learning strategy (Wav2DF-TSL) based on pre-training and hierarchical expert fusion for robust audio deepfake detection. In the pre-training stage, we use adapters to efficiently learn artifacts from 3000 hours of unlabelled spoofed speech, improving the adaptability of front-end features while mitigating catastrophic forgetting. In the fine-tuning stage, we propose the hierarchical adaptive mixture of experts (HA-MoE) method to dynamically fuse multi-level spoofing cues through multi-expert collaboration with gated routing. Experimental results show that the proposed method significantly outperforms the baseline system on all four benchmark datasets, especially on the cross-domain In-the-wild dataset, achieving a 27.5% relative improvement in equal error rate (EER), outperforming the existing state-of-the-art systems. Index Terms: audio deepfake detection, self-supervised learning, parameter-efficient fine-tuning, mixture of experts

Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures

Sound

Finds fake voices better, even new ones.

15 Sep 2025 0

89%

KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Audio and Speech Processing

Finds fake videos by listening to the sound.

10 Aug 2025 1

89%

When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection

Sound

Catches fake voices mixed with real ones.

9 Sep 2025 0

View PDF Login to Bookmark

Page Count

8 pages

Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection

Finds fake voices in recordings better.

Technical Abstract

Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures

KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection