ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

View on arXiv ← Back to list

Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee

Published: 2021-02-11 08:41:42+00:00

AI Summary

This paper analyzes the results of the ASVspoof 2019 challenge, focusing on the top-performing systems for detecting synthesized, converted, and replayed speech. The findings highlight the effectiveness of fusion techniques for logical access scenarios and the significant gap between simulated and real replay data performance.

Abstract

The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV). This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols being described elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62 teams, all of which out-perform the two baseline systems, often by a substantial margin. Deeper analyses shows that performance is dominated by specific conditions involving either specific spoofing attacks or specific acoustic environments. While fusion is shown to be particularly effective for the logical access scenario involving speech synthesis and voice conversion attacks, participants largely struggled to apply fusion successfully for the physical access scenario involving simulated replay attacks. This is likely the result of a lack of system complementarity, while oracle fusion experiments show clear potential to improve performance. Furthermore, while results for simulated data are promising, experiments with real replay data show a substantial gap, most likely due to the presence of additive noise in the latter. This finding, among others, leads to a number of ideas for further research and directions for future editions of the ASVspoof challenge.

Key findings

Fusion of multiple CM systems was particularly effective in logical access scenarios. A substantial performance gap was observed between simulated and real replay data in physical access scenarios, likely due to additive noise in real data. The best systems generally used spectral features and deep neural network classifiers.

Approach

The challenge evaluated various spoofing countermeasure (CM) systems submitted by different teams. These systems used diverse features (e.g., LFCCs, Mel-spectrograms, CQCCs) and architectures (e.g., CNNs, ResNets, GMM-UBMs) to detect spoofed speech. Performance was measured using the tandem detection cost function (t-DCF).

Datasets

ASVspoof 2019 database, sourced from the Voice Cloning Toolkit (VCTK) corpus. This included logical access (LA) scenarios with speech synthesis and voice conversion attacks, and physical access (PA) scenarios with simulated and real replay attacks.

Model(s)

Various models were used, including CNNs, ResNets, GMM-UBMs, SVMs, and combinations thereof. Specific architectures varied widely depending on the team's submission.

Author countries

France, Japan, Finland, Spain, Singapore

← Previous