Subband modeling for spoofing detection in automatic speaker verification

View on arXiv ← Back to list

Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos

Published: 2020-04-04 12:49:21+00:00

AI Summary

This paper investigates the impact of different frequency subbands on replay spoofing detection in automatic speaker verification. A joint subband modeling framework using multiple CNNs, each trained on a different subband, is proposed, showing improved performance over full-band models on the ASVspoof 2017 dataset. However, this improvement didn't generalize to the ASVspoof 2019 PA dataset.

Abstract

Spectrograms - time-frequency representations of audio signals - have found widespread use in neural network-based spoofing detection. While deep models are trained on the fullband spectrum of the signal, we argue that not all frequency bands are useful for these tasks. In this paper, we systematically investigate the impact of different subbands and their importance on replay spoofing detection on two benchmark datasets: ASVspoof 2017 v2.0 and ASVspoof 2019 PA. We propose a joint subband modelling framework that employs n different sub-networks to learn subband specific features. These are later combined and passed to a classifier and the whole network weights are updated during training. Our findings on the ASVspoof 2017 dataset suggest that the most discriminative information appears to be in the first and the last 1 kHz frequency bands, and the joint model trained on these two subbands shows the best performance outperforming the baselines by a large margin. However, these findings do not generalise on the ASVspoof 2019 PA dataset. This suggests that the datasets available for training these models do not reflect real world replay conditions suggesting a need for careful design of datasets for training replay spoofing countermeasures.

Key findings

The joint subband model significantly outperformed baseline full-band CNN and GMM models on the ASVspoof 2017 dataset, with the most discriminative information found in the lowest and highest 1 kHz frequency bands. However, this performance did not generalize to the ASVspoof 2019 PA dataset, highlighting the need for more realistic training datasets.

Approach

The authors propose a joint subband modeling framework that uses multiple CNNs, each trained on a non-overlapping frequency subband of the audio spectrogram. The outputs of these sub-networks are concatenated and fed into a classifier. The entire network is trained jointly.

Datasets

ASVspoof 2017 v2.0, ASVspoof 2019 PA, ASVspoof 2019 real PA

Model(s)

Convolutional Neural Networks (CNNs), Gaussian Mixture Models (GMMs), Feedforward Neural Network (FFNN)

Author countries

Finland, United Kingdom

← Previous