EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge

Authors: Xiaoxuan Guo, Hengyan Huang, Jiayi Zhou, Renhe Sun, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang

Published: 2025-12-23 13:54:02+00:00

Comment: ESDD 2026 Challenge Technical Report

AI Summary

This paper proposes EnvSSLAM-FFN, a lightweight system for environmental sound deepfake detection in the ESDD 2026 Challenge. It combines a frozen SSLAM self-supervised encoder with a feed-forward network (FFN) back-end. The system employs fusion of intermediate SSLAM representations (layers 4-9) and a class-weighted training objective to address data imbalance and effectively capture spoofing artifacts.

Abstract

Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.


Key findings
The EnvSSLAM-FFN system significantly outperforms official baselines on both tracks of the ESDD 2026 Challenge. It achieved Test Equal Error Rates (EERs) of 1.20% on Track 1 (unseen generators) and 1.05% on Track 2 (black-box low-resource detection), demonstrating superior detection capabilities and adaptability compared to baselines which had EERs above 12%.
Approach
The EnvSSLAM-FFN system uses a frozen SSLAM self-supervised encoder to extract frame-level embeddings, which are then processed by a lightweight FFN back-end. It fuses intermediate SSLAM representations from layers 4-9 and utilizes attentive statistics pooling for temporal aggregation. A class-weighted binary cross-entropy loss is employed to mitigate severe label imbalance during training.
Datasets
EnvSDD dataset
Model(s)
SSLAM (Self-Supervised Learning with Audio Mixtures) encoder, Feed-Forward Network (FFN) back-end
Author countries
China