The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Published: 2026-03-05 06:40:57+00:00

AI Summary

This paper introduces and analyzes the first Environmental Sound Deepfake Detection (ESDD) challenge, aiming to benchmark robustness and advance research in this underexplored field. It details the challenge formulation, dataset construction, evaluation protocols, and baseline systems. The paper also analyzes common architectural choices and training strategies of top-performing systems, providing key insights and future research directions for ESDD.

Abstract

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.


Key findings
The challenge demonstrated that high-fidelity generative models can severely degrade conventional deepfake detection baselines, especially against unseen generators. However, robust generalization is achievable through leveraging large-scale self-supervised representations, carefully designed data augmentation strategies, and ensemble modeling. Top-performing systems achieved significantly lower EERs, highlighting the effectiveness of generator-agnostic spoofing cues and the challenge of detecting deepfakes from advanced TTA and black-box VTA models.
Approach
The paper outlines the design and execution of the first ESDD challenge, which included two tracks: one testing robustness against unseen TTA/ATA generators and another for low-resource black-box VTA generators. It established the EnvSDD dataset, defined evaluation metrics (EER), and provided baseline detection systems. The work then analyzes the performance and strategies employed by the top-ranking participant systems.
Datasets
EnvSDD, UrbanSound8K, DCASE 2023 Task 7 Dev, TAU 2019 Open Dev, TUT SED 2016, TUT SED 2017, Clotho, AudioCaps, VGGSound
Model(s)
AASIST, BEATs (as feature extractor), EAT (Efficient Audio Transformer), SSLAM (Self-Supervised Learning for Audio Mixtures), BiCrossMamba, BEAT2AASIST, Multi-Head Factorized Attention (MHFA), LoRA, ArcFace loss, Domain Adversarial Training
Author countries
Republic of Korea, Australia, Singapore, China