SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods

View on arXiv ← Back to list

Authors: Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian

Published: 2025-07-29 03:06:31+00:00

AI Summary

The paper introduces SpeechFake, a large-scale multilingual speech deepfake dataset containing over 3 million deepfake samples generated using 40 different speech synthesis tools. This dataset addresses limitations in existing datasets by providing scale, diversity in generation methods, and multilingual support, enabling the development of more robust deepfake detection models.

Abstract

As speech generation technology advances, the risk of misuse through deepfake audio has become a pressing concern, which underscores the critical need for robust detection systems. However, many existing speech deepfake datasets are limited in scale and diversity, making it challenging to train models that can generalize well to unseen deepfakes. To address these gaps, we introduce SpeechFake, a large-scale dataset designed specifically for speech deepfake detection. SpeechFake includes over 3 million deepfake samples, totaling more than 3,000 hours of audio, generated using 40 different speech synthesis tools. The dataset encompasses a wide range of generation techniques, including text-to-speech, voice conversion, and neural vocoder, incorporating the latest cutting-edge methods. It also provides multilingual support, spanning 46 languages. In this paper, we offer a detailed overview of the dataset's creation, composition, and statistics. We also present baseline results by training detection models on SpeechFake, demonstrating strong performance on both its own test sets and various unseen test sets. Additionally, we conduct experiments to rigorously explore how generation methods, language diversity, and speaker variation affect detection performance. We believe SpeechFake will be a valuable resource for advancing speech deepfake detection and developing more robust models for evolving generation techniques.

Key findings

Models trained on SpeechFake significantly outperformed those trained on existing datasets, especially on unseen deepfakes. Analysis revealed that generation methods and language diversity significantly impacted detection performance, while speaker variation had minimal effect. The results highlight the challenges of generalization in deepfake detection and the value of SpeechFake for advancing the field.

Approach

The authors created the SpeechFake dataset by generating deepfake audio samples using various cutting-edge speech synthesis tools, encompassing text-to-speech, voice conversion, and neural vocoder methods. They then trained state-of-the-art deepfake detection models (AASIST and W2V+AASIST) on this dataset to establish performance baselines and analyze the impact of different generation methods, languages, and speaker variations.

Datasets

SpeechFake (Bilingual and Multilingual datasets), LibriTTS, VCTK, AISHELL1, AISHELL3, CommonVoice, ASVspoof2019-LA, FakeOrReal, WaveFake, In-the-Wild, CD-ADD, ASVspoof5

Model(s)

AASIST, W2V+AASIST

Author countries

China

← Previous