Post-training for Deepfake Speech Detection

Authors: Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi

Published: 2025-06-26 08:34:19+00:00

Comment: Corrected previous implementation of EER calculation. Slight numerical changes in some of the results

AI Summary

This paper introduces a post-training approach for deepfake speech detection, adapting self-supervised learning (SSL) models to bridge the gap between general pre-training and domain-specific fine-tuning. Named AntiDeepfake models, they are developed using a large-scale multilingual speech dataset comprising over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts. These models achieve strong robustness and generalization to unseen deepfake speech, consistently surpassing existing state-of-the-art detectors when further fine-tuned.

Abstract

We introduce a post-training approach that adapts self-supervised learning (SSL) models for deepfake speech detection by bridging the gap between general pre-training and domain-specific fine-tuning. We present AntiDeepfake models, a series of post-trained models developed using a large-scale multilingual speech dataset containing over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts in over one hundred languages. Experimental results show that the post-trained models already exhibit strong robustness and generalization to unseen deepfake speech. When they are further fine-tuned on the Deepfake-Eval-2024 dataset, these models consistently surpass existing state-of-the-art detectors that do not leverage post-training. Model checkpoints and source code are available online.


Key findings
The post-trained AntiDeepfake models demonstrate strong zero-shot robustness and generalization across various unseen deepfake datasets without any fine-tuning. When further fine-tuned on the challenging Deepfake-Eval-2024 dataset, they consistently achieve state-of-the-art performance, significantly outperforming detectors without the post-training phase. Additionally, visualization confirms that post-training enables the models to learn more discriminative embedding representations, which also show utility for related tasks like partial spoof detection.
Approach
The authors introduce a post-training phase after general self-supervised pre-training and before task-specific fine-tuning. This phase involves exposing pre-trained SSL models (like wav2vec 2.0 and HuBERT) to a large, diverse dataset of genuine speech and various types of speech artifacts across over one hundred languages. The models are optimized with a discriminative cross-entropy loss to learn better representations for distinguishing genuine speech from artifacts.
Datasets
ASVspoof2019-LA, ASVspoof2021-LA, ASVspoof2021-DF, ASVspoof5, CFAD, DECRO, DFADD, Diffuse or Confuse, DiffSSD, DSD, HABLA, MLAAD, SpoofCeleb, VoiceMOS, CVoiceFake, LibriTTS, LibriTTS-Vocoded, LJSpeech, VoxCeleb2, VoxCeleb2-Vocoded, FLEURS, FLEURS-R, LibriTTS-R, Codecfake, CodecFake, AISHELL3, CNCeleb2, MLS (for post-training); FakeOrReal, FakeOrReal-norm, In-the-Wild, DEEP-VOICE, ADD 2023 (Track-1.2-R2-Test), Deepfake-Eval-2024 (for testing).
Model(s)
HuBERT (HuBERT-XL), wav2vec 2.0 (W2V-Small, W2V-Large, MMS-300M, MMS-1B, XLS-R-1B, XLS-R-2B). Other models mentioned for comparison include XLSR-Mamba, Resemble AI, SpeechFake, Wav2Vec + VIB, UniSpeech-SAT, XLS-R + SLS, XLSR-Conformer + TCM, AdaLAM & f-InfoED, P3, AASIST, RawNet2.
Author countries
Japan