A Survey on Speech Deepfake Detection

Authors: Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang

Published: 2024-04-22 06:52:12+00:00

Comment: 38 pages. This paper has been accepted by ACM Computing Surveys

AI Summary

This survey provides a comprehensive analysis of over 200 papers on speech deepfake detection published up to March 2024. It systematically reviews each component of the detection pipeline, including model architectures, optimization techniques, datasets, and evaluation metrics. The paper assesses recent progress, discusses ongoing challenges, explores emerging topics like partial deepfake detection and adversarial defenses, and suggests promising future research directions.

Abstract

The availability of smart devices leads to an exponential increase in multimedia content. However, advancements in deep learning have also enabled the creation of highly sophisticated Deepfake content, including speech Deepfakes, which pose a serious threat by generating realistic voices and spreading misinformation. To combat this, numerous challenges have been organized to advance speech Deepfake detection techniques. In this survey, we systematically analyze more than 200 papers published up to March 2024. We provide a comprehensive review of each component in the detection pipeline, including model architectures, optimization techniques, generalizability, evaluation metrics, performance comparisons, available datasets, and open source availability. For each aspect, we assess recent progress and discuss ongoing challenges. In addition, we explore emerging topics such as partial Deepfake detection, cross-dataset evaluation, and defences against adversarial attacks, while suggesting promising research directions. This survey not only identifies the current state of the art to establish strong baselines for future experiments but also offers clear guidance for researchers aiming to enhance speech Deepfake detection systems.


Key findings
The field is transitioning from hand-crafted features to deep embeddings, particularly those from pre-trained self-supervised learning models. Noise addition and codec augmentation are effective data augmentation techniques, while OC-Softmax and hybrid loss functions significantly enhance detection performance. Simple ensemble mechanisms for Spoofing-Aware Speaker Verification (SASV) currently outperform single integrated systems, suggesting a need for better joint optimization strategies. Key challenges include improving reproducibility, cross-dataset generalization, interpretability, robustness to multilingual and adversarial attacks, and efficient real-time detection of neural codec-based deepfakes.
Approach
The paper systematically analyzes over 200 research papers on speech deepfake detection, providing a comprehensive review of the field's advancements. It evaluates various detection pipeline components, including feature extraction methods, classifier architectures, and training optimization techniques. The survey also discusses challenges in areas such as generalizability, interpretability, and robustness, while outlining future research directions.
Datasets
ASVspoof (2015, 2019-LA, 2021-LA, 2021-DF), FakeorReal-original (FoR), FMFCC-A, WaveFake, ADD (2022-LF, 2022-PF, 2023-PF), In-the-Wild (ITW), Latin-American Voice Anti-spoofing, TIMIT-TTS, DeepVoice, Multi-Language Audio Anti-spoofing (MLAAD), CodecFake, CFAD, PartialSpoof, Half-Truth (HAD), Partial Synthetic Detection (Psynd), VoxCeleb2.
Model(s)
Traditional ML classifiers (GMM, RF, SVM), Deep Learning Classifiers (CNN, ResNet, GNN, Transformer, TDNN, DART, Bi-LSTM, MLP, RNN, U-Net, Conformer), and End-to-End architectures. Many models utilize learnable filter-banks (e.g., SincNet) and pre-trained self-supervised learning embeddings (e.g., wav2vec 2.0, WavLM, HuBERT).
Author countries
Canada, China