AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

View on arXiv ← Back to list

Authors: Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie

Published: 2025-09-04 16:03:44+00:00

AI Summary

The paper introduces AUDETER, a large-scale (3 million audio clips, 4,500+ hours) dataset for deepfake audio detection, addressing the limitations of existing datasets in handling diverse and up-to-date audio samples. Experiments show that models trained on AUDETER significantly outperform state-of-the-art methods on cross-domain evaluation, reducing detection error rates by 44.1% to 51.6%.

Abstract

Speech generation systems can produce remarkably realistic vocalisations that are often indistinguishable from human speech, posing significant authenticity challenges. Although numerous deepfake detection methods have been developed, their effectiveness in real-world environments remains unrealiable due to the domain shift between training and test samples arising from diverse human speech and fast evolving speech synthesis systems. This is not adequately addressed by current datasets, which lack real-world application challenges with diverse and up-to-date audios in both real and deep-fake categories. To fill this gap, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale, highly diverse deepfake audio dataset for comprehensive evaluation and robust development of generalised models for deepfake audio detection. It consists of over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders with a broad range of TTS/vocoder patterns, totalling 3 million audio clips, making it the largest deepfake audio dataset by scale. Through extensive experiments with AUDETER, we reveal that i) state-of-the-art (SOTA) methods trained on existing datasets struggle to generalise to novel deepfake audio samples and suffer from high false positive rates on unseen human voice, underscoring the need for a comprehensive dataset; and ii) these methods trained on AUDETER achieve highly generalised detection performance and significantly reduce detection error rate by 44.1% to 51.6%, achieving an error rate of only 4.17% on diverse cross-domain samples in the popular In-the-Wild dataset, paving the way for training generalist deepfake audio detectors. AUDETER is available on GitHub.

Key findings

State-of-the-art deepfake detection methods struggle to generalize to new deepfake audio samples and exhibit high false positive rates. Models trained on AUDETER achieve significantly improved generalization performance, reducing error rates by up to 51.6% on the In-the-Wild dataset. The diversity of real and synthetic audio in AUDETER is crucial for improving model performance.

Approach

AUDETER addresses the problem of deepfake audio detection by creating a large-scale, diverse dataset encompassing various recent TTS models and vocoders, along with diverse real audio sources. State-of-the-art deepfake detection models are then trained on this dataset to improve their generalization capabilities.

Datasets

AUDETER (created by the authors), ASVSpoof 2019, ASVSpoof 2021, In-the-Wild, WaveFake, LibriSeVoc, Common Voice, People's Speech, Multilingual LibriSpeech

Model(s)

RawNet2, RawGAT-ST, AASIST, PC-Dart, SAMO, Neural Vocoder Artifacts (NVA), Purdue M2, XLS-R + RawNet + Assist (XLS+R+A), XLS+SLS

Author countries

Australia, Singapore

← Previous