ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

Published: 2021-09-01 16:17:31+00:00

AI Summary

The ASVspoof 2021 challenge focused on advancing spoofed and deepfake speech detection. It introduced a new deepfake speech detection task alongside logical and physical access tasks, evaluating progress without providing matched training data, reflecting real-world scenarios.

Abstract

ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.


Key findings
Results showed encouraging progress in logical access and deepfake tasks, with the best system achieving a min t-DCF of 0.2177 and an EER of 1.32% for logical access. The physical access task proved more challenging, while the deepfake task showed evidence of overfitting to the progress partition. Statistical significance tests revealed notable performance differences between top-ranked systems.
Approach
The challenge involved developing countermeasures for three tasks: logical access, physical access, and deepfake speech detection. Participants were evaluated using tandem detection cost function (t-DCF) for logical and physical access, and equal error rate (EER) for deepfake detection. No matched training or development data was provided.
Datasets
ASVspoof 2019 databases (VCTK corpus) for logical and physical access, and ASVspoof 2019 LA evaluation set and other undisclosed corpora for the deepfake task. New evaluation partitions were created for all three tasks.
Model(s)
GMM-based systems using CQCCs or LFCCs, LFCC-LCNN, and RawNet2 architecture were used as baselines. Participants were free to use any other models.
Author countries
Japan, China, Italy, Bangladesh, Spain, Germany, Unknown