Audio Deepfake Detection: A Survey

Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao

Published: 2023-08-29 01:50:01+00:00

AI Summary

This survey paper provides a systematic overview of audio deepfake detection, analyzing various deepfake audio types, competitions, datasets, features, classifications, and evaluation metrics for state-of-the-art approaches. It performs a unified comparison of representative features and classifiers on key datasets. The authors identify critical future research directions, including the need for large-scale in-the-wild datasets, improved generalization to unknown attacks, and better interpretability of detection results.

Abstract

Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences across various types of deepfake audio, then outline and analyse competitions, datasets, features, classifications, and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are discussed. In addition, we perform a unified comparison of representative features and classifiers on ASVspoof 2021, ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively. The survey shows that future research should address the lack of large scale datasets in the wild, poor generalization of existing detection methods to unknown fake attacks, as well as interpretability of detection results.


Key findings
Detection systems perform significantly worse in out-of-domain evaluations compared to in-domain tests, highlighting poor generalization. Features derived from pre-trained models (e.g., XLS-R) and concatenated features show more robustness to out-of-domain data than traditional hand-crafted features. Future research should focus on collecting diverse, large-scale multilingual datasets, improving generalization and robustness to unseen attacks, and enhancing the interpretability of detection results, especially for forensics and attribution.
Approach
The authors conduct a comprehensive survey by outlining and analyzing existing literature on audio deepfake detection, covering aspects like fake audio types, competitions, datasets, features, classifiers, and evaluation metrics. They also perform a unified comparison of representative features and classifiers on benchmark datasets to assess current performance and identify challenges.
Datasets
ASVspoof 2021, ADD 2023, In-the-Wild
Model(s)
GMM, LCNN, ResNet, ASSERT, Res2Net, AFN, GRU, GAT, RawNet2, AASIST
Author countries
China