AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit

Authors: Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

Published: 2025-09-25 21:09:40+00:00

AI Summary

The paper introduces AUDDT, an open-source toolkit designed to be an Audio Unified Deepfake Detection Benchmark, incorporating a systematic review and standardized evaluation across 28 diverse datasets. The toolkit automates the evaluation of pretrained detectors to assess their generalization capabilities across various deepfake generation methods and acoustic conditions. Using a widely adopted baseline model, the authors demonstrate significant performance disparities, highlighting critical weaknesses in generalization to out-of-domain data.

Abstract

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.


Key findings
The W2V-AASIST baseline achieved high accuracy on conventional datasets but suffered significant performance drops (down to 26%) when tested on high-quality diffusion-based deepfakes, genetically unrelated languages (Japanese), and in-the-wild speech. The detector showed high sensitivity to neural artifacts introduced by vocoders and speech enhancement, misclassifying up to 82% of enhanced real speech as fake, suggesting it relies on these artifacts rather than deepfake-specific cues for detection.
Approach
AUDDT provides an automated benchmarking framework that links any pretrained deepfake detector to 28 different audio deepfake datasets, standardizing data downloading, label preparation, inference, and metric calculation. The framework categorizes datasets based on attributes like language, perturbation type, and generative method to allow granular performance analysis. A baseline detector (W2V-AASIST) is benchmarked to showcase the toolkit's utility and expose current generalization gaps.
Datasets
28 existing audio deepfake datasets, including ASVspoof series (2019, 2021, 5), DiffSSD, DiffuseOrConfuse, DFADD, CodecFake, MLAAD-v5, and EnhanceSpeech.
Model(s)
W2V-AASIST (Wav2vec2-XLSR-300M frontend + AASIST classifier backend)
Author countries
Canada