Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection
By: Ivan Viakhirev , Daniil Sirota , Aleksandr Smirnov and more
Potential Business Impact:
Stops fake voices from tricking voice security.
Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6\% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.
Similar Papers
XMUspeech Systems for the ASVspoof 5 Challenge
Sound
Finds fake voices in audio recordings.
Sparse deepfake detection promotes better disentanglement
Sound
Finds fake voices by looking at hidden sound patterns.
ASVspoof 5: Evaluation of Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Signal Processing
Finds fake voices in recordings.