A Data-Centric Approach to Generalizable Speech Deepfake Detection

Authors: Wen Huang, Yuchen Mao, Yanmin Qian

Published: 2025-12-20 04:28:33+00:00

AI Summary

This paper introduces a data-centric approach to generalizable speech deepfake detection (SDD), emphasizing the critical role of data composition over model-centric solutions. It characterizes data scaling laws for SDD, quantifying the impact of source and generator diversity, and proposes the Diversity-Optimized Sampling Strategy (DOSS) for mixing heterogeneous data. The DOSS framework achieves state-of-the-art generalization performance with superior data and model efficiency on public benchmarks and a new challenge set of commercial APIs.

Abstract

Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.


Key findings
The study found that source and generator diversity are the primary drivers of generalization in SDD, following predictable power laws, and are more impactful than raw data volume. The Diversity-Optimized Sampling Strategy (DOSS), particularly DOSS-Weight, significantly outperforms naive data aggregation. The final DOSS-trained model (XLS-R-1B, 12k-hours) achieved state-of-the-art performance, outperforming baselines (XLS-R-2B, 74k-hours) with greater data and model efficiency on both public benchmarks and challenging commercial APIs.
Approach
The authors identify data diversity as a primary driver for generalizable SDD, proposing the Diversity-Optimized Sampling Strategy (DOSS) to manage heterogeneous data mixtures. DOSS comes in two implementations: DOSS-Select for data pruning and DOSS-Weight for re-weighting, both approximating a uniform distribution across domains to maximize diversity. This strategy efficiently trains models by balancing the impact of various speech deepfake sources and generators.
Datasets
For training and validation: A curated 12k-hour data pool comprising 17 public datasets (e.g., ASVspoof2019, ASVspoof2021, ADD2022, ADD2023, SpeechFake) and self-generated data from 7 recent TTS/VC models. For evaluation: 10 public benchmarks (e.g., ASVspoof2019, DECRO, InTheWild, SpoofCeleb, EmoFake, ODSS) and a new challenge set of 9 commercial TTS APIs (e.g., Google Cloud TTS, ElevenLabs TTS, OpenAI GPT-4o mini TTS, Qwen3 TTS Flash).
Model(s)
XLS-R (300M-parameter and 1B-parameter versions) as the self-supervised backbone, adapted with a temporal average pooling layer and an MLP classifier head.
Author countries
China