ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling

Published: 2019-11-05 03:51:37+00:00

AI Summary

This paper introduces ASVspoof 2019, a large-scale public database for synthesized, converted, and replayed speech aimed at advancing research in automatic speaker verification (ASV) spoofing countermeasures. The database includes diverse spoofing attacks generated using state-of-the-art techniques and is designed to reflect logical and physical access scenarios.

Abstract

Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as presentation attacks. These vulnerabilities are generally unacceptable and call for spoofing countermeasures or presentation attack detection systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects.


Key findings
Various spoofing attacks significantly degraded ASV performance, with waveform concatenation-based attacks being particularly effective. Baseline countermeasures showed varying effectiveness against different attacks; CQCC-based countermeasures generally outperformed LFCC-based ones. Human evaluation revealed that some synthetic speech was perceptually indistinguishable from genuine speech.
Approach
The ASVspoof 2019 database was created to evaluate speaker verification systems and their countermeasures against various spoofing attacks. It uses a tandem detection cost function (t-DCF) metric, focusing on the impact of spoofing and countermeasures on ASV system reliability. The database includes logical access (synthetic speech and converted voice) and physical access (replay attacks) scenarios, simulated under controlled conditions.
Datasets
VCTK corpus (downsampled to 16kHz), VoxCeleb1 and VoxCeleb2 databases.
Model(s)
DNN-based x-vector speaker embeddings with a PLDA backend for ASV; GMM with CQCC or LFCC features for countermeasures.
Author countries
Japan, UK, France, Germany, Finland, Taiwan, Ireland, USA, China