Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

Authors: Fuming Fang, Junichi Yamagishi, Isao Echizen, Md Sahidullah, Tomi Kinnunen

Published: 2018-09-12 06:45:42+00:00

AI Summary

This paper investigates a novel playback spoofing attack against speaker verification systems by enhancing stolen speech using a speech enhancement generative adversarial network (SEGAN). The attack significantly increases equal error rates for existing countermeasures, demonstrating a vulnerability in current playback detection methods.

Abstract

Automatic speaker verification (ASV) systems use a playback detector to filter out playback attacks and ensure verification reliability. Since current playback detection models are almost always trained using genuine and played-back speech, it may be possible to degrade their performance by transforming the acoustic characteristics of the played-back speech close to that of the genuine speech. One way to do this is to enhance speech stolen from the target speaker before playback. We tested the effectiveness of a playback attack using this method by using the speech enhancement generative adversarial network to transform acoustic characteristics. Experimental results showed that use of this enhanced stolen speech method significantly increases the equal error rates for the baseline used in the ASVspoof 2017 challenge and for a light convolutional neural network-based method. The results also showed that its use degrades the performance of a Gaussian mixture model-universal background model-based ASV system. This type of attack is thus an urgent problem needing to be solved.


Key findings
The enhanced stolen speech attack significantly increased equal error rates for both baseline and advanced playback detection countermeasures. The attack also degraded the performance of a GMM-UBM based speaker verification system, highlighting a critical vulnerability in current systems. The effectiveness of the attack varied depending on the quality of the loudspeaker and recording devices used.
Approach
The authors enhance stolen target speaker speech using SEGAN to make it resemble genuine speech, thereby masking its playback nature. This enhanced speech is then used in a playback attack against speaker verification systems and their countermeasures.
Datasets
ASVspoof 2017 database (version 2) derived from the RedDots corpus, TIMIT and RSR2015 corpora, VCTK corpus, DR-VCTK corpus, and N-VCTK corpus.
Model(s)
Speech Enhancement Generative Adversarial Network (SEGAN), Gaussian Mixture Model (GMM), Light Convolutional Neural Network (LCNN), and Gaussian Mixture Model-Universal Background Model (GMM-UBM).
Author countries
Japan, UK, France, Finland