EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge
By: Xiaoxuan Guo , Hengyan Huang , Jiayi Zhou and more
Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.
Similar Papers
Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026
Sound
Detects fake sounds to keep audio real.
ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan
Sound
Detects fake sounds in videos and games.
BEAT2AASIST model with layer fusion for ESDD 2026 Challenge
Sound
Detects fake sounds to stop audio tricks.