Post-training for Deepfake Speech Detection

View on arXiv ← Back to list

Authors: Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi

Published: 2025-06-26 08:34:19+00:00

AI Summary

This paper introduces a post-training approach to adapt self-supervised learning (SSL) models for deepfake speech detection. By training on a large multilingual dataset of genuine and artifacted speech, the resulting AntiDeepfake models outperform existing state-of-the-art detectors, demonstrating strong robustness and generalization to unseen deepfakes.

Abstract

We introduce a post-training approach that adapts self-supervised learning (SSL) models for deepfake speech detection by bridging the gap between general pre-training and domain-specific fine-tuning. We present AntiDeepfake models, a series of post-trained models developed using a large-scale multilingual speech dataset containing over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts in over one hundred languages. Experimental results show that the post-trained models already exhibit strong robustness and generalization to unseen deepfake speech. When they are further fine-tuned on the Deepfake-Eval-2024 dataset, these models consistently surpass existing state-of-the-art detectors that do not leverage post-training. Model checkpoints and source code are available online.

Key findings

Post-trained models showed strong robustness and generalization to unseen deepfake speech even without fine-tuning. Fine-tuning post-trained models on Deepfake-Eval-2024 consistently outperformed state-of-the-art detectors that did not use post-training. Larger models generally performed better, though smaller models excelled on certain datasets.

Approach

The authors propose a post-training phase between SSL pre-training and fine-tuning. This involves training pre-trained SSL models (wav2vec 2.0 and HuBERT) on a massive dataset containing genuine speech and speech with various artifacts. A global average pooling layer and a fully connected layer are used for binary classification.

Datasets

A large-scale multilingual speech dataset containing over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts (synthesized, vocoded, restored, codec speech etc.) in over one hundred languages. Additionally, the Deepfake-Eval-2024 dataset was used for fine-tuning and evaluation. Other datasets like ASVspoof, FakeOrReal, In-the-Wild, DEEP-VOICE, ADD 2023, LibriTTS, VoxCeleb2, etc. were used for training and/or evaluation.

Model(s)

wav2vec 2.0 and HuBERT based SSL models of various sizes (e.g., HuBERT-XL, W2V-Small, W2V-Large, MMS-300M, MMS-1B, XLS-R-1B, XLS-R-2B).

Author countries

Japan

← Previous