XMUspeech Systems for the ASVspoof 5 Challenge
By: Wangjie Li , Xingjia Xie , Yishuang Li and more
Potential Business Impact:
Finds fake voices in audio recordings.
In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.
Similar Papers
Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection
Sound
Finds fake voices in recordings better.
Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection
Sound
Stops fake voices from tricking voice security.
Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection
Sound
Spots fake voices even when they change.