ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

Published: 2024-08-16 13:37:20+00:00

AI Summary

The ASVspoof 2024 challenge focuses on evaluating speech spoofing and deepfake detection systems. It uses a significantly larger, crowdsourced dataset with diverse acoustic conditions and incorporates adversarial attacks for the first time, pushing the limits of current detection technologies.

Abstract

ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.


Key findings
Baseline systems performed poorly against the advanced attacks and diverse data. Many participant submissions significantly outperformed the baselines, especially those using external data and pre-trained self-supervised models. Score calibration was identified as a significant area for future improvement.
Approach
The challenge evaluates both stand-alone spoofing detection and spoofing-robust automatic speaker verification (SASV) systems. Participants submitted detection scores, which were evaluated using metrics like minimum detection cost function (minDCF) and equal error rate (EER). The challenge included closed and open conditions, allowing for the use of external data in the latter.
Datasets
Multilingual Librispeech (MLS) English partition; VoxCeleb 1 and 2 (for the common ASV system); Copy synthesis data (for Track 2 baseline)
Model(s)
RawNet2, AASIST, ECAPA-TDNN (in the common ASV system), MFA-Conformer (in Track 2 baseline), various models submitted by participants (including those using pre-trained self-supervised models like wav2vec 2.0)
Author countries
China, Spain, India, South Korea, Singapore