Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

Published: 2024-04-07 10:10:15+00:00

AI Summary

This paper introduces a new cross-domain audio deepfake detection (CD-ADD) dataset with over 300 hours of speech data generated by five advanced zero-shot TTS models, addressing the limitations of existing datasets. Experiments using Wav2Vec2 and Whisper models demonstrate high accuracy and few-shot learning capabilities, highlighting the challenges posed by neural codec compression.

Abstract

Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.


Key findings
Attack-augmented training significantly improves the robustness of the models. The Wav2Vec2-large and Whisper-medium models achieve low equal error rates (4.1% and 6.5%, respectively). Few-shot learning demonstrates the models' ability to adapt quickly to new zero-shot TTS models with limited data.
Approach
The authors fine-tune pre-trained speech encoders (Wav2Vec2 and Whisper) for audio deepfake detection. They employ attack-augmented training to improve robustness and evaluate few-shot learning capabilities using only one minute of target-domain data. Multi-layer features are merged using learnable weights, and a classifier head generates final logits.
Datasets
A new cross-domain ADD dataset (CD-ADD) with over 300 hours of speech data generated by five zero-shot TTS models (VALL-E, YourTTS, WhisperSpeech, Seamless Expressive, OpenVoice), LibriTTS (train-clean-100, dev-clean, test-clean subsets), TEDLium3 (test set), and ASVSpoof2019.
Model(s)
Wav2Vec2 (base, large), Whisper (medium)
Author countries
China