The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Authors: Weicheng Cai, Haiwei Wu, Danwei Cai, Ming Li

Published: 2019-07-05 03:00:05+00:00

AI Summary

This paper presents a deep learning-based system for replay attack detection in the ASVspoof 2019 challenge. The system leverages data augmentation (speed perturbation), explores various feature representations (including group delay gram), and employs a residual neural network for classification, achieving a low equal error rate (EER) of 0.66% on the evaluation set through system fusion.

Abstract

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.


Key findings
The proposed system achieved a significant performance improvement over the baseline GMM system, reaching an EER of 0.66% on the evaluation set after system fusion. The group delay gram feature proved particularly effective, and data augmentation via speed perturbation further enhanced performance. Results on development and evaluation sets were consistent, indicating robustness.
Approach
The authors developed an utterance-level deep learning framework using a residual neural network (ResNet). They explored various audio features extracted from both magnitude and phase spectrums, applying speed perturbation for data augmentation and finally fusing multiple systems for improved performance.
Datasets
ASVspoof 2019 challenge dataset
Model(s)
Residual Neural Network (ResNet), specifically a ResNet-34 like architecture but with fewer parameters.
Author countries
China