A Survey on Speech Deepfake Detection

Authors: Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang

Published: 2024-04-22 06:52:12+00:00

AI Summary

This survey paper analyzes over 200 research papers on speech deepfake detection published up to March 2024. It provides a comprehensive review of the detection pipeline, including model architectures, optimization techniques, datasets, and evaluation metrics, identifying current state-of-the-art and suggesting future research directions.

Abstract

The availability of smart devices leads to an exponential increase in multimedia content. However, advancements in deep learning have also enabled the creation of highly sophisticated Deepfake content, including speech Deepfakes, which pose a serious threat by generating realistic voices and spreading misinformation. To combat this, numerous challenges have been organized to advance speech Deepfake detection techniques. In this survey, we systematically analyze more than 200 papers published up to March 2024. We provide a comprehensive review of each component in the detection pipeline, including model architectures, optimization techniques, generalizability, evaluation metrics, performance comparisons, available datasets, and open source availability. For each aspect, we assess recent progress and discuss ongoing challenges. In addition, we explore emerging topics such as partial Deepfake detection, cross-dataset evaluation, and defences against adversarial attacks, while suggesting promising research directions. This survey not only identifies the current state of the art to establish strong baselines for future experiments but also offers clear guidance for researchers aiming to enhance speech Deepfake detection systems.


Key findings
The survey highlights the shift towards deep embedding representations, especially from pre-trained self-supervised learning models, for improved performance and cross-dataset generalization. Optimization techniques like noise addition and codec augmentation are shown to be effective for robustness. The integration of ASV with deepfake detection is also discussed, with simple ensemble methods currently outperforming fully integrated systems.
Approach
The paper is a survey, not proposing a novel method itself. It systematically reviews existing literature on speech deepfake detection, covering various aspects of the detection pipeline, from feature extraction and model architectures to training optimization and evaluation metrics.
Datasets
ASVspoof2015, FoR-original, ASVspoof2019-LA, ASVspoof2021-LA, ASVspoof2021-DF, FMFCC-A, WaveFake, ADD2022-LF, In-the-Wild (ITW), Latin-American Voice Anti-spoofing, TIMIT-TTS, DeepVoice, Multi-Language Audio Anti-spoofing (MLAAD), CodecFake, Codecfake, Chinese Fake Audio Detection (CFAD), PartialSpoof, Half-Truth (HAD), ADD2022-PF, Psynd, ADD2023-PF, VoxCeleb2
Model(s)
Various deep learning models including CNNs (LCNN, ResNet, Res2Net, DenseNet), RNNs (LSTM, GRU, Bi-LSTM), GNNs (GCN, GAT), Transformers, TDNN (ECAPA-TDNN), and other architectures (Siamese networks, Autoencoders). Traditional machine learning classifiers like SVM, GMM, and Random Forest are also mentioned.
Author countries
Canada, China