Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

Authors: Candy Olivia Mawalim, Haotian Zhang, Shogo Okada

Published: 2025-12-05 03:37:18+00:00

AI Summary

This paper presents the Nomi Team's work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. They propose an audio-text cross-attention model to address unseen generators and low-resource black-box scenarios. Experiments demonstrate competitive EER improvements over the challenge baseline, particularly when integrating semantic text and using an ensemble model.

Abstract

This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).


Key findings
The proposed ATCA model achieves lower EER than the BEATs+AASIST baseline in Track 1 (unseen generators, 11.28% vs 13.20%). The ensemble model (ATCA-ens) further reduces EER in both tracks, achieving 11.22% in Track 1 and 11.98% in Track 2 (low-resource black-box), outperforming the baseline (13.20% and 12.48% respectively). The text modality is particularly beneficial against unseen generators, though its impact is somewhat diminished in low-resource scenarios.
Approach
The team proposes an audio-text cross-attention (ATCA) model where acoustic features query text embeddings (derived from audio captions) to enhance audio representations. The audio backbone leverages BEATs and AASIST. An ensemble model further combines multiple ATCA variants and the BEATs-AASIST baseline with RoBERTa text features and classical regressors (gradient boosting, random forest, linear regressors) as a meta-learner.
Datasets
EnvSDD dataset (which compiles real sounds from UrbanSound8K, TAU UAS 2019 Open Dev, TUT SED 2016, TUT SED 2017, DCASE 2023 Task7 Dev, and Clotho).
Model(s)
AASIST, BEATs, RoBERTa (base), audio-text cross-attention (ATCA) model, transformer encoder-decoder for audio captioning, stacked Gated Recurrent Unit (GRU) network, gradient boosting, random forest, linear regressors.
Author countries
Japan