Multi-Task Siamese Neural Network for Improving Replay Attack Detection

Authors: Patrick von Platen, Fei Tao, Gokhan Tur

Published: 2020-02-16 00:21:16+00:00

AI Summary

This paper proposes using a multi-task Siamese Neural Network (SNN) for improved replay attack detection in speaker verification systems. The SNN significantly outperforms a ResNet baseline by reducing the Equal Error Rate (EER) by 26.8%, and further improvements are achieved with the addition of reconstruction loss.

Abstract

Automatic speaker verification systems are vulnerable to audio replay attacks which bypass security by replaying recordings of authorized speakers. Replay attack detection (RA) detection systems built upon Residual Neural Networks (ResNet)s have yielded astonishing results on the public benchmark ASVspoof 2019 Physical Access challenge. With most teams using fine-tuned feature extraction pipelines and model architectures, the generalizability of such systems remains questionable though. In this work, we analyse the effect of discriminative feature learning in a multi-task learning (MTL) setting can have on the generalizability and discriminability of RA detection systems. We use a popular ResNet architecture optimized by the cross-entropy criterion as our baseline and compare it to the same architecture optimized by MTL using Siamese Neural Networks (SNN). It can be shown that SNN outperform the baseline by relative 26.8 % Equal Error Rate (EER). We further enhance the model's architecture and demonstrate that SNN with additional reconstruction loss yield another significant improvement of relative 13.8 % EER.


Key findings
The proposed SNN significantly improves replay attack detection performance, reducing the EER by 26.8% compared to a ResNet baseline. Adding reconstruction loss further enhances the model, achieving a best single-system EER of 1.94%. The use of second-order statistics also contributes to performance gains.
Approach
The authors address the problem of replay attack detection by employing a multi-task learning approach with Siamese Neural Networks (SNNs). The SNN uses a pair of audio inputs during training, minimizing both cross-entropy loss and a distance loss between the embeddings of the input pairs. Further improvements are obtained by adding reconstruction loss to enhance feature learning.
Datasets
ASVspoof 2019 Physical Access (PA) dataset
Model(s)
34-layer ResNet architecture, Siamese Neural Networks (SNNs)
Author countries
Germany, USA