Detection of Deepfake Environmental Audio

Authors: Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller

Published: 2024-03-26 09:35:16+00:00

AI Summary

This paper proposes a deepfake environmental audio detection pipeline using CLAP audio embeddings. Evaluated on the 2023 DCASE challenge dataset, the method achieves 98% accuracy in detecting fake sounds generated by 44 state-of-the-art synthesizers, showing a 10% improvement over using VGGish embeddings.

Abstract

With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based on the CLAP audio embedding. We evaluate this detector using audio data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art synthesizers can be detected on average with 98% accuracy. We show that using an audio embedding learned on environmental audio is beneficial over a standard VGGish one as it provides a 10% increase in detection performance. Informal listening to Incorrect Negative examples demonstrates audible features of fake sounds missed by the detector such as distortion and implausible background noise.


Key findings
The CLAP embedding-based model achieved the highest accuracy (98%), significantly outperforming VGGish and PANN embeddings. Human listening tests revealed that the system missed some audible cues of fakeness, suggesting areas for future improvement in both detection and generation models.
Approach
The authors use a simple pipeline leveraging pre-trained audio embeddings (VGGish, CLAP, PANN) as input to a multi-layer perceptron (MLP) for binary classification of fake/real environmental sounds. Time averaging of embeddings is used to avoid temporal sequence effects.
Datasets
2023 DCASE challenge task on Foley sound synthesis; containing over 6 hours of real and 28 hours of generated audio across seven sound classes.
Model(s)
Multi-layer perceptron (MLP) using VGGish, CLAP, and PANN audio embeddings.
Author countries
France, U.S.