ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan
By: Xueping Zhang , Han Yin , Yang Xiao and more
Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
Similar Papers
Environmental Sound Deepfake Detection Challenge: An Overview
Sound
Finds fake sounds in videos and audio.
ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan
Sound
Detects fake sounds in videos and games.
Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026
Sound
Detects fake sounds to keep audio real.