I Can Hear You: Selective Robust Training for Deepfake Audio Detection

View on arXiv ← Back to list

Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

Published: 2024-10-31 18:21:36+00:00

AI Summary

This paper introduces DeepFakeVox-HQ, the largest public deepfake audio dataset, and proposes F-SAT, a Frequency-Selective Adversarial Training method to improve deepfake audio detection robustness. F-SAT focuses on high-frequency components, which are easily manipulated by attackers, improving accuracy on both clean and corrupted/attacked samples.

Abstract

Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.

Key findings

Training on DeepFakeVox-HQ boosted baseline model performance by 33%. F-SAT further improved accuracy by 7.7% on clean samples and 29.3% on corrupted and attacked samples compared to RawNet3. The model's reliance on high-frequency features, easily manipulated by attackers, was identified and mitigated by F-SAT.

Approach

The authors address the vulnerability of existing deepfake audio detectors to high-frequency features by proposing F-SAT. F-SAT performs adversarial training focusing on high-frequency components of audio waveforms, enhancing robustness against attacks and corruptions while maintaining accuracy on clean data. They also incorporate random audio augmentations to improve model robustness.

Datasets

DeepFakeVox-HQ (1.3 million samples, including 270,000 high-quality deepfakes from 14 sources), ASVspoof2019, WaveFake, VCTK, LibriSpeech, AudioSet, ASRspoof2019, Voxceleb1, ASRspoof2021, YouTube, X (Twitter).

Model(s)

RawNet3, RawNet2, RawGAT-ST, TE-ResNet

Author countries

USA

← Previous