Speech Replay Detection with x-Vector Attack Embeddings and Spectral Features

Authors: Jennifer Williams, Joanna Rownicka

Published: 2019-09-23 12:27:04+00:00

AI Summary

This paper presents a system for speech replay detection submitted to the ASVspoof 2019 challenge. The system combines x-vector attack embeddings, jointly modeling environment and attack types, with sub-band spectral centroid magnitude coefficients (SCMCs) as input to a convolutional neural network (CNN). The approach outperforms challenge baselines using tandem detection cost function (tDCF) and equal error rate (EER) metrics.

Abstract

We present our system submission to the ASVspoof 2019 Challenge Physical Access (PA) task. The objective for this challenge was to develop a countermeasure that identifies speech audio as either bona fide or intercepted and replayed. The target prediction was a value indicating that a speech segment was bona fide (positive values) or spoofed (negative values). Our system used convolutional neural networks (CNNs) and a representation of the speech audio that combined x-vector attack embeddings with signal processing features. The x-vector attack embeddings were created from mel-frequency cepstral coefficients (MFCCs) using a time-delay neural network (TDNN). These embeddings jointly modeled 27 different environments and 9 types of attacks from the labeled data. We also used sub-band spectral centroid magnitude coefficients (SCMCs) as features. We included an additive Gaussian noise layer during training as a way to augment the data to make our system more robust to previously unseen attack examples. We report system performance using the tandem detection cost function (tDCF) and equal error rate (EER). Our approach performed better that both of the challenge baselines. Our technique suggests that our x-vector attack embeddings can help regularize the CNN predictions even when environments or attacks are more challenging.


Key findings
The combined approach of x-vector attack embeddings and SCMC features outperforms both challenge baselines in terms of tDCF and EER. The x-vector embeddings, while not individually strong detectors, improve performance when combined with signal processing features. The additive Gaussian noise layer improves robustness.
Approach
The system uses a CNN for regression, with the target converted to numerical values (-1 for spoofed, +1 for bona fide). Input features are a combination of x-vector embeddings (trained to differentiate between environment and attack types) and SCMC features. An additive Gaussian noise layer is used for data augmentation.
Datasets
ASVspoof 2019 Challenge Physical Access (PA) task dataset; 54,000 training instances and 29,700 development instances.
Model(s)
Convolutional Neural Network (CNN) with 3 Conv1D layers, max pooling, batch normalization, an additive Gaussian noise layer, and a fully connected output layer with a hyperbolic tangent activation function.
Author countries
United Kingdom