Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

View on arXiv ← Back to list

Authors: Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye

Published: 2025-08-14 11:56:30+00:00

AI Summary

This paper introduces the Fake Speech Wild (FSW) dataset, a 254-hour collection of real and deepfake audio from social media platforms, to address the limitations of existing deepfake audio detection models in real-world scenarios. By augmenting public datasets with FSW and employing self-supervised learning-based countermeasures, the authors significantly improve deepfake audio detection performance, achieving an average equal error rate of 3.54%.

Abstract

The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.

Key findings

Models trained on public datasets performed poorly on the FSW dataset, highlighting the domain gap. Data augmentation strategies improved robustness. Joint training using augmented public datasets and FSW achieved a significant reduction in the equal error rate (EER) to 3.54% across all evaluation sets.

Approach

The authors created the FSW dataset of real and deepfake audio from social media. They then benchmarked existing self-supervised learning (SSL)-based deepfake detection models on this dataset and public datasets. Finally, they improved detection performance by augmenting public datasets and incorporating FSW into model training.

Datasets

Fake Speech Wild (FSW), ASVspoof2019LA (19LA), Codecfake, CFAD, In the Wild (ITW), VCTK, AISHELL

Model(s)

AASIST, WavLM-AASIST, XLSR-AASIST

Author countries

China

← Previous