Environmental Sound Deepfake Detection Challenge: An Overview

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Published: 2025-12-30 11:03:36+00:00

AI Summary

This paper provides an overview of the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge, which introduced EnvSDD, the first large-scale dataset for ESDD. The challenge aimed to develop effective methods for detecting fake environmental sounds, addressing limitations in existing datasets. The paper analyzes challenge results and highlights common effective design choices observed in top-performing systems across two distinct tracks.

Abstract

Recent progress in audio generation models has made it possible to create highly realistic and immersive soundscapes, which are now widely used in film and virtual-reality-related applications. However, these audio generators also raise concerns about potential misuse, such as producing deceptive audio for fabricated videos or spreading misleading information. Therefore, it is essential to develop effective methods for detecting fake environmental sounds. Existing datasets for environmental sound deepfake detection (ESDD) remain limited in both scale and the diversity of sound categories they cover. To address this gap, we introduced EnvSDD, the first large-scale curated dataset designed for ESDD. Based on EnvSDD, we launched the ESDD Challenge, recognized as one of the ICASSP 2026 Grand Challenges. This paper presents an overview of the ESDD Challenge, including a detailed analysis of the challenge results.


Key findings
The challenge demonstrated the effectiveness of current deepfake detection methods, with top teams achieving low Equal Error Rates (EERs) of 0.30% for Track 1 and 0.25% for Track 2. Successful strategies predominantly involved combining self-supervised learning (SSL) based front-ends with AASIST-style back-ends, extensive data augmentation, and ensemble approaches. Surprisingly, the black-box low-resource setting (Track 2) did not present a significantly greater challenge than generalizing to unseen generators (Track 1).
Approach
The paper introduces the EnvSDD dataset and organized the ESDD Challenge with two tracks focusing on generalization to unseen generators and black-box low-resource settings. It then summarizes the effective approaches adopted by participating teams, which commonly involve combining strong pre-trained self-supervised learning (SSL) front-ends with robust back-end classifiers and leveraging ensemble methods.
Datasets
EnvSDD, AudioCaps
Model(s)
AASIST, BEATs, EAT, SSLAM, BiCrossMamba-ST, FFN, ArcFace, LoRA
Author countries
Republic of Korea, Australia, Singapore, China