Deep Generative Variational Autoencoding for Replay Spoof Detection in Automatic Speaker Verification

Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos

Published: 2020-03-21 00:56:05+00:00

AI Summary

This paper proposes using variational autoencoders (VAEs) as a backend for replay attack detection in automatic speaker verification. Three VAE models are explored, with the conditional VAE (C-VAE) showing significant improvements over separate VAEs and a Gaussian mixture model (GMM) baseline, achieving a 9-10% absolute improvement in EER and t-DCF on the ASVspoof 2019 dataset.

Abstract

Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount - yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs - one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals - the absolute difference of the original input and the reconstruction as features for spoofing detection.


Key findings
The C-VAE significantly outperforms the baseline GMM and a naive VAE approach on both datasets. On the ASVspoof 2019 dataset, the C-VAE achieves a 9-10% absolute improvement in both EER and t-DCF. Using VAE residuals as features for a separate CNN classifier also shows promising results.
Approach
The authors propose using variational autoencoders (VAEs), specifically a conditional VAE (C-VAE), to model the distributions of genuine and spoofed speech. The C-VAE is trained using class labels as conditional input, allowing it to learn discriminative information in the latent space. The difference in scores generated by the C-VAE for genuine and spoofed inputs is used for classification.
Datasets
ASVspoof 2017 and ASVspoof 2019 physical access subtask datasets
Model(s)
Variational Autoencoders (VAEs), Conditional VAEs (C-VAEs), Gaussian Mixture Models (GMMs), Convolutional Neural Networks (CNNs)
Author countries
Finland, United Kingdom