Texture-based Presentation Attack Detection for Automatic Speaker Verification

Authors: Lazaro J. Gonzalez-Soler, Jose Patino, Marta Gomez-Barrero, Massimiliano Todisco, Christoph Busch, Nicholas Evans

Published: 2020-10-08 15:03:29+00:00

AI Summary

This paper proposes a presentation attack detection (PAD) method for automatic speaker verification using texture descriptors applied to speech spectrogram images. A common Fisher vector feature space, based on a generative model, is used to improve the generalizability of PAD solutions, achieving low error rates for both known and unknown attacks.

Abstract

Biometric systems are nowadays employed across a broad range of applications. They provide high security and efficiency and, in many cases, are user friendly. Despite these and other advantages, biometric systems in general and Automatic speaker verification (ASV) systems in particular can be vulnerable to attack presentations. The most recent ASVSpoof 2019 competition showed that most forms of attacks can be detected reliably with ensemble classifier-based presentation attack detection (PAD) approaches. These, though, depend fundamentally upon the complementarity of systems in the ensemble. With the motivation to increase the generalisability of PAD solutions, this paper reports our exploration of texture descriptors applied to the analysis of speech spectrogram images. In particular, we propose a common fisher vector feature space based on a generative model. Experimental results show the soundness of our approach: at most, 16 in 100 bona fide presentations are rejected whereas only one in 100 attack presentations are accepted.


Key findings
The proposed method achieved low error rates, particularly with BSIF descriptors and CQT spectrograms. The Fisher vector encoding improved generalizability to unknown attacks, outperforming baseline methods and achieving a low detection error trade-off (DET) with better performance than state-of-the-art for some attacks in the ASVSpoof 2019 challenge.
Approach
The approach converts speech signals into time-frequency representations (spectrograms). Texture descriptors (LBP, MB-LBP, LPQ, BSIF) are extracted from these images and encoded using Fisher vectors. A support vector machine (SVM) classifies the Fisher vectors as either bona fide or attack presentations.
Datasets
ASVSpoof 2019 database (Logical Access and Physical Access scenarios)
Model(s)
Support Vector Machine (SVM) with Fisher Vector encoding of texture descriptors (LBP, MB-LBP, LPQ, BSIF). Gaussian Mixture Model (GMM) used for Fisher Vector generation.
Author countries
Germany, France