ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

View on arXiv ← Back to list

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Published: 2025-08-06 15:09:44+00:00

AI Summary

This paper proposes EnvSDD, a large-scale dataset for environmental sound deepfake detection, and launches the Environmental Sound Deepfake Detection Challenge (ESDD 2026) based on it. The challenge features two tracks: one focusing on unseen generators and another on black-box low-resource detection.

Abstract

Recent advances in audio generation systems have enabled the creation of highly realistic and immersive soundscapes, which are increasingly used in film and virtual reality. However, these audio generators also raise concerns about potential misuse, such as generating deceptive audio content for fake videos and spreading misleading information. Existing datasets for environmental sound deepfake detection (ESDD) are limited in scale and audio types. To address this gap, we have proposed EnvSDD, the first large-scale curated dataset designed for ESDD, consisting of 45.25 hours of real and 316.7 hours of fake sound. Based on EnvSDD, we are launching the Environmental Sound Deepfake Detection Challenge. Specifically, we present two different tracks: ESDD in Unseen Generators and Black-Box Low-Resource ESDD, covering various challenges encountered in real-life scenarios. The challenge will be held in conjunction with the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026).

Key findings

Baseline results show significantly higher equal error rates (EERs) on unseen generators and black-box low-resource scenarios compared to the validation sets, highlighting the difficulty of generalizing to unknown deepfake generation methods. Incorporating the BEATs model improves performance but the challenge remains substantial.

Approach

The challenge uses the EnvSDD dataset, containing real and fake environmental sounds generated by various methods. Two tracks are proposed: one evaluates generalization to unseen generators (TTA and ATA), and the other tests low-resource detection of completely unknown generation methods.

Datasets

EnvSDD (combining UrbanSound8K, TAU UAS 2019 Open Dev, TUT SED 2016, TUT SED 2017, DCASE 2023 Task7 Dev, and Clotho for real sounds; and various TTA and ATA models for fake sounds).

Model(s)

AASIST (using a heterogeneous stacking graph attention mechanism) and BEATs+AASIST (incorporating the pre-trained BEATs audio foundation model).

Author countries

Republic of Korea, Australia, Singapore, China

← Previous