Unmasking real-world audio deepfakes: A data-centric approach

Authors: David Combei, Adriana Stan, Dan Oneata, Nicolas Müller, Horia Cucu

Published: 2025-06-11 11:03:26+00:00

AI Summary

This paper introduces a new dataset, AI4T, of real-world audio deepfakes collected from online platforms. Instead of focusing on model complexity, it employs data-centric approaches (curation, pruning, augmentation) to significantly improve deepfake detection performance on both AI4T and the In-the-Wild dataset, achieving substantial reductions in Equal Error Rate (EER).

Abstract

The growing prevalence of real-world deepfakes presents a critical challenge for existing detection systems, which are often evaluated on datasets collected just for scientific purposes. To address this gap, we introduce a novel dataset of real-world audio deepfakes. Our analysis reveals that these real-world examples pose significant challenges, even for the most performant detection models. Rather than increasing model complexity or exhaustively search for a better alternative, in this work we focus on a data-centric paradigm, employing strategies like dataset curation, pruning, and augmentation to improve model robustness and generalization. Through these methods, we achieve a 55% relative reduction in EER on the In-the-Wild dataset, reaching an absolute EER of 1.7%, and a 63% reduction on our newly proposed real-world deepfakes dataset, AI4T. These results highlight the transformative potential of data-centric approaches in enhancing deepfake detection for real-world applications. Code and data available at: https://github.com/davidcombei/AI4T.


Key findings
Data-centric methods significantly improved deepfake detection performance. A 55% relative reduction in EER was achieved on the In-the-Wild dataset (reaching 1.7% EER), and a 63% reduction on the AI4T dataset. The results highlight the importance of data quality and data-centric approaches over solely focusing on model complexity for real-world deepfake detection.
Approach
The paper uses a data-centric approach to improve audio deepfake detection. This involves dataset curation, pruning (removing less informative samples using various strategies), and augmentation to enhance model robustness and generalization without changing the model architecture. A self-supervised learning (SSL) based model is used as the baseline detector.
Datasets
AI4T (newly proposed real-world audio deepfakes dataset), In-the-Wild (ITW) dataset, ASV19, FoR, ASV21, TIMIT, ODSS, MLAAD v5, ASV5, M-AILABS
Model(s)
wav2vec2 XLS-R 2B (features extracted), logistic regression classifier
Author countries
Romania, Germany