SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

View on arXiv ← Back to list

Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

Published: 2024-09-18 23:17:02+00:00

AI Summary

The paper introduces SpoofCeleb, a new dataset for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV). It uses real-world, noisy speech from VoxCeleb1 to train 23 TTS systems, generating a large and diverse dataset of both bona fide and spoofed speech, addressing limitations of existing datasets.

Abstract

This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.

Key findings

Baseline SDD and SASV models trained on SpoofCeleb significantly outperformed those trained on other datasets, demonstrating the dataset's effectiveness. The results highlight the challenges posed by real-world, noisy speech in both deepfake generation and detection. The large scale and diversity of SpoofCeleb offer a valuable resource for advancing research in SDD and SASV.

Approach

The authors created SpoofCeleb by developing a fully automated pipeline to process VoxCeleb1 into a format suitable for TTS training. They then trained 23 contemporary TTS systems on this processed data to generate spoofed speech, combining this with the original VoxCeleb1 data to create the dataset.

Datasets

VoxCeleb1, ASVspoof2019, LibriSpeechGigaSpeech, Multilingual LibriSpeech

Model(s)

TransformerTTS, GradTTS, Matcha-TTS, BVAE-TTS, DiffWave, HiFiGAN, Parallel WaveGAN, NSF-HiFiGAN, BigVGAN, WaveGlow, VALL-E, Multi-Scale Transformer, Delay, MQTTS, VITS, RawNet2, AASIST, SKA-TDNN

Author countries

USA, China, Japan, South Korea, France

← Previous