ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

Authors: Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

Published: 2026-01-12 08:27:06+00:00

AI Summary

This paper introduces the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on detecting component-level manipulations where either speech or environmental sounds (or both) can be synthesized or altered. To address this, they propose the large-scale CompSpoofV2 dataset and a separation-enhanced joint learning framework. The challenge aims to promote research in this more realistic and complex audio deepfake detection scenario.

Abstract

Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).


Key findings
The baseline separation-enhanced joint learning framework achieved Macro-F1 scores of approximately 0.62-0.63 on the evaluation and test sets of CompSpoofV2. While the model performed well in detecting original audio and spoofed speech (low EERs), detecting spoofed environmental sound components proved more challenging, indicated by significantly higher EER values (around 0.37-0.43).
Approach
The proposed approach utilizes a separation-enhanced joint learning framework that first detects potentially spoofed mixtures, then separates the audio into speech and environmental components. These components are subsequently processed by specific anti-spoofing models, whose outputs are fused to predict one of five component-level spoofing categories.
Datasets
CompSpoofV2 (250k samples, 283 hours). This dataset is curated from multiple sources including AudioCaps, VGGSound, CommonVoice, LibriTTS, english-conversation-corpus, TAUUAS, TUTSED, UrbanSound, EnvSDD, VcapAV, ASV5, and MLAAD.
Model(s)
UNKNOWN
Author countries
China, South Korea, Australia, USA, Singapore