Multi-task Learning Based Spoofing-Robust Automatic Speaker Verification System

Authors: Yuanjun Zhao, Roberto Togneri, Victor Sreeram

Published: 2020-12-06 01:03:35+00:00

AI Summary

This paper proposes a spoofing-robust automatic speaker verification (SR-ASV) system using a multi-task learning architecture. This deep learning model jointly trains speaker verification and spoofing detection, achieving substantial performance improvements over state-of-the-art systems on the ASVspoof 2017 and 2019 corpora.

Abstract

Spoofing attacks posed by generating artificial speech can severely degrade the performance of a speaker verification system. Recently, many anti-spoofing countermeasures have been proposed for detecting varying types of attacks from synthetic speech to replay presentations. While there are numerous effective defenses reported on standalone anti-spoofing solutions, the integration for speaker verification and spoofing detection systems has obvious benefits. In this paper, we propose a spoofing-robust automatic speaker verification (SR-ASV) system for diverse attacks based on a multi-task learning architecture. This deep learning based model is jointly trained with time-frequency representations from utterances to provide recognition decisions for both tasks simultaneously. Compared with other state-of-the-art systems on the ASVspoof 2017 and 2019 corpora, a substantial improvement of the combined system under different spoofing conditions can be obtained.


Key findings
The proposed SR-ASV system significantly outperforms state-of-the-art integrated systems on both ASVspoof 2017 and 2019 corpora under various spoofing conditions. The system achieves lower equal error rates (EERs) and tandem decision cost functions (t-DCFs) than baseline and benchmark systems. Fusion of results from different feature sets further improves performance.
Approach
The authors address the problem by using a multi-task learning architecture. This architecture jointly trains a deep learning model on time-frequency representations of utterances for both speaker verification and spoofing detection simultaneously. Sequential residual convolutional blocks with Max-Feature-Map activations are used to improve the model's generalization.
Datasets
ASVspoof 2017 Version 2.0, ASVspoof 2019, VoxCeleb
Model(s)
Multi-task learning architecture based on deep neural networks (DNNs) with sequential residual convolutional blocks and Max-Feature-Map (MFM) activations. A-softmax loss function is used for training, and a PLDA back-end is used for speaker verification.
Author countries
Australia