BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge

Authors: Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky

Published: 2025-12-09 07:32:55+00:00

AI Summary

This paper presents the BUT submission to the ESDD 2026 Challenge, focusing on environmental sound deepfake detection with unseen generators. The main contribution is a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models coupled with a Multi-Head Factorized Attention (MHFA) back-end. A feature domain augmentation strategy based on distribution uncertainty modeling is also introduced to enhance robustness against unseen spectral distortions.

Abstract

This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\\%, 4.60\\%, and 4.80\\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\\%, 3.52\\%, and 4.38\\% across the same partitions.


Key findings
The proposed system significantly outperformed official baselines, with the best single system achieving an EER of 4.80% and the fusion system achieving 4.38% on the Final Evaluation set. General audio SSL models consistently showed superior performance over speech-specific SSLs for environmental sound deepfake detection. Feature domain augmentation with DSU and fine-tuning on AudioSet-2M further enhanced model robustness and generalization to unseen generators.
Approach
The approach combines diverse Self-Supervised Learning (SSL) models (BEATs, EAT, Dasheng, WavLM) as front-ends to extract hierarchical features, which are then fed into a lightweight Multi-Head Factorized Attention (MHFA) back-end for classification. To improve generalization to unseen generators, a feature domain augmentation strategy based on Distribution Uncertainty (DSU) is applied to the Value stream of the MHFA.
Datasets
EnvSDD, AudioSet-2M (AS2M)
Model(s)
BEATs, EAT, Dasheng, WavLM, Multi-Head Factorized Attention (MHFA)
Author countries
Czechia, USA, Hong Kong SAR