AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit

Authors: Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

Published: 2025-09-25 21:09:40+00:00

AI Summary

This paper introduces AUDDT, an open-source toolkit for benchmarking audio deepfake detection models across 28 diverse datasets. It aims to automate the evaluation process, providing insights into the generalization capabilities and shortcomings of pretrained detectors. The toolkit also highlights limitations of current datasets and their gap relative to real-world deployment.

Abstract

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.


Key findings
Benchmarking a baseline detector (W2V-AASIST) revealed significant performance disparities, with high accuracy on conventional deepfakes but severe degradation on higher-quality, diffusion-based deepfakes, unseen languages, and in-the-wild speech. The detector showed notable sensitivity to 'unharmful' neural artifacts from vocoders, neural codecs, and speech enhancement, often misclassifying them as deepfakes. Existing datasets predominantly feature studio-quality, scripted speech, lacking diversity in real-world perturbations, languages, accents, and human expressivity.
Approach
AUDDT provides an automated, end-to-end benchmarking pipeline that integrates any pretrained deepfake detector with 28 diverse audio deepfake datasets. It standardizes data downloading and label preparation, then performs inference and calculates various metrics like EER, accuracy, TPR, TNR, and AUC-ROC across different deepfake categories.
Datasets
ASVspoof5, ASVspoof2019 LA, ASVspoof2021 DF, ASVspoof2021 LA, CodecFake, Codecfake, CtrSVDD, CVoiceFake, DECRO, DFADD, DiffSSD, DiffuseOrConfuse, EnhanceSpeech, FoR-original, FoR-2seconds, FoR-norm, FoR-rerecorded, HABLA, In-the-wild, MLAAD-v5, MSceneSpeech, ODSS, Playback attacks, SpoofCeleb, JVNV, SRC4VC, TIMIT-TTS, WaveFake (total of 28 datasets).
Model(s)
W2V-AASIST (Wav2vec2-XLSR-300M as frontend and AASIST classifier as backend).
Author countries
Canada