Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge

Authors: Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu

Published: 2019-04-23 03:29:36+00:00

AI Summary

This paper proposes an end-to-end deep neural network (DNN) for replay attack detection in speaker verification, using high-resolution spectrograms with complementary information (magnitude, phase, and power spectral density). The approach avoids handcrafted features, focusing instead on directly modeling raw audio information for improved robustness against advanced spoofing techniques.

Abstract

In this study, we concentrate on replacing the process of extracting hand-crafted acoustic feature with end-to-end DNN using complementary high-resolution spectrograms. As a result of advance in audio devices, typical characteristics of a replayed speech based on conventional knowledge alter or diminish in unknown replay configurations. Thus, it has become increasingly difficult to detect spoofed speech with a conventional knowledge-based approach. To detect unrevealed characteristics that reside in a replayed speech, we directly input spectrograms into an end-to-end DNN without knowledge-based intervention. Explorations dealt in this study that differentiates from existing spectrogram-based systems are twofold: complementary information and high-resolution. Spectrograms with different information are explored, and it is shown that additional information such as the phase information can be complementary. High-resolution spectrograms are employed with the assumption that the difference between a bona-fide and a replayed speech exists in the details. Additionally, to verify whether other features are complementary to spectrograms, we also examine raw waveform and an i-vector based system. Experiments conducted on the ASVspoof 2019 physical access challenge show promising results, where t-DCF and equal error rates are 0.0570 and 2.45 % for the evaluation set, respectively.


Key findings
The proposed system significantly outperforms a baseline CQCC-GMM system on the ASVspoof 2019 physical access challenge evaluation set, achieving a t-DCF of 0.0570 and an EER of 2.45%. High-resolution spectrograms and the inclusion of complementary information (phase and PSD) are shown to be crucial for improved performance.
Approach
The authors use an end-to-end DNN, composed of convolutional neural networks (CNNs), gated recurrent units (GRUs), and fully connected layers, to directly process high-resolution spectrograms (including magnitude, phase, and power spectral density) without handcrafted feature extraction. Model and score-level ensembles are explored to combine information from different spectrograms.
Datasets
ASVspoof 2019 physical access dataset
Model(s)
CNN-GRU architecture with residual blocks. Model-level and score-level ensembles of multiple DNNs are also used.
Author countries
South Korea