FakeSound: Deepfake General Audio Detection

View on arXiv ← Back to list

Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

Published: 2024-06-12 10:07:40+00:00

AI Summary

This paper introduces the task of deepfake general audio detection, proposing the FakeSound dataset generated via an automated manipulation pipeline. A benchmark deepfake detection model, surpassing human performance and state-of-the-art speech deepfake detection, is presented.

Abstract

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.

Key findings

The proposed model outperforms both state-of-the-art speech deepfake detection models and human evaluators on the FakeSound dataset. Human accuracy is consistently below 0.6, highlighting the challenge of this task. Performance decreases when tested on zero-shot data from different domains, suggesting future research into domain adaptation is needed.

Approach

The authors propose a deepfake detection model using a general audio pre-trained model (EAT) as a feature extractor. This model identifies deepfake audio and locates deepfake regions using a ResNet, Transformer, LSTM, and classification layers. Multi-task learning is explored to improve performance.

Datasets

FakeSound dataset (created by the authors), AudioCaps dataset

Model(s)

EAT (Efficient Audio Transformer) pre-trained model, ResNet, Transformer, Bi-directional LSTM, CNN blocks

Author countries

China

← Previous