EnvSDD: Benchmarking Environmental Sound Deepfake Detection

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

Published: 2025-05-25 16:02:56+00:00

AI Summary

This paper introduces EnvSDD, the first large-scale dataset for environmental sound deepfake detection, comprising 45.25 hours of real and 316.74 hours of fake audio. It also proposes a new deepfake detection system using a pre-trained audio foundation model (BEATs), which outperforms existing state-of-the-art methods from speech and singing domains.

Abstract

Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.


Key findings
The proposed BEATs+AASIST system significantly outperforms baseline systems (AASIST and W2V2+AASIST) across various test conditions, including in-domain and out-of-domain evaluations. The results highlight the effectiveness of using a pre-trained audio foundation model for environmental sound deepfake detection. However, generalization to unseen domains remains a challenge.
Approach
The authors created the EnvSDD dataset using real-world audio and fake audio generated by text-to-audio and audio-to-audio models. They then developed a deepfake detection system by integrating the pre-trained audio foundation model BEATs with the AASIST architecture.
Datasets
EnvSDD (created by the authors), UrbanSound8K, DCASE 2023 Task7 Dev, TAU UAS 2019 Open Dev, TUT SED 2016, TUT SED 2017, Clotho, AudioSet-2M
Model(s)
AASIST, W2V2+AASIST, BEATs+AASIST, AudioLDM, AudioLDM 2, AudioGen, TangoFlux, AudioLCM, wav2vec 2.0 XLS-R, BEATs, Mistral 7B (LLM)
Author countries
China, Australia, Singapore, UK