Generalizable Detection of Audio Deepfakes

Authors: Jose A. Lopez, Georg Stemmer, Héctor Cordourier Maruri

Published: 2025-07-02 14:28:11+00:00

Comment: 8 pages, 3 figures

AI Summary

This paper presents a comprehensive study to enhance the generalization capabilities of audio deepfake detection models. The authors investigate various pre-trained backbones (Wav2Vec2, WavLM, Whisper), different data augmentation strategies, and novel loss functions across a diverse set of datasets. Their research demonstrates significant improvements in generalization, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge.

Abstract

In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demonstrate substantial enhancements in the generalization capabilities of audio deepfake detection models, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge. This study contributes valuable insights into the optimization of audio models for more robust deepfake detection and facilitates future research in this critical area.


Key findings
The study found that Wav2Vec2 models generally outperformed other backbones like WavLM and Whisper. Novel loss functions, focal loss and hinged-center loss, along with data augmentation strategies like RawBoost, AWGN, and RIR, significantly improved generalization. The developed approach achieved state-of-the-art performance, surpassing the top-ranked single system in the ASVspoof 5 Challenge and demonstrating strong generalization across various test sets, with EER improving with longer and cleaner audio.
Approach
The authors enhance generalization by exploring pre-trained backbones (Wav2Vec2, WavLM, Whisper) combined with a simple classifier head, and investigate novel loss functions like focal loss and a hinged-center loss. They also extensively evaluate various data augmentation strategies, including AWGN, RawBoost, vocoded audio, and room impulse response (RIR) augmentation, to improve robustness across diverse deepfake scenarios.
Datasets
ASVspoof 2019 LA, ASVspoof 5, proprietary collection (Speecon US), ASVspoof 2015, ASVspoof 2021 (LA progress, LA eval, LA hidden, DF progress, DF eval, DF hidden), In-The-Wild (ITW), M-AILABS, MLAAD v4, DeepFake Detection Challenge (DFDC) (audio-only), FakeAVCeleb, LJ Speech, FB ASR Fairness dataset.
Model(s)
Wav2Vec2 (XLS-R 300M, XLS-R 1B, XLSR-53), WavLM (Large), Whisper (Medium), with a fully-connected classifier head.
Author countries
USA, Germany, Mexico