Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

Authors: Xiaohai Tian, Rohan Kumar Das, Haizhou Li

Published: 2019-09-17 08:54:17+00:00

Comment: 6 pages, 3 figures, This paper is submitted to ICASSP 2020

AI Summary

This paper proposes a black-box adversarial framework that enhances voice conversion (VC) attacks on Automatic Speaker Verification (ASV) systems. It uses the ASV system's output scores as feedback to a VC system, optimizing the converted speech to be more deceptive without needing internal ASV knowledge. Experiments demonstrate that this feedback-controlled VC significantly boosts impostor ASV scores while maintaining natural speech quality.

Abstract

Automatic speaker verification (ASV) systems in practice are greatly vulnerable to spoofing attacks. The latest voice conversion technologies are able to produce perceptually natural sounding speech that mimics any target speakers. However, the perceptual closeness to a speaker's identity may not be enough to deceive an ASV system. In this work, we propose a framework that uses the output scores of an ASV system as the feedback to a voice conversion system. The attacker framework is a black-box adversary that steals one's voice identity, because it does not require any knowledge about the ASV system but the system outputs. Experimental results conducted on ASVspoof 2019 database confirm that the proposed feedback-controlled voice conversion framework produces adversarial samples that are more deceptive than the straightforward voice conversion, thereby boosting the impostor ASV scores. Further, the perceptual evaluation studies reveal that converted speech does not adversely affect the voice quality from the baseline system.


Key findings
The proposed feedback-controlled VC system (PPG-VC-FC) produces more deceptive adversarial samples, significantly boosting impostor ASV scores compared to baseline VC and shifting score distributions towards genuine target speakers. Perceptual evaluation studies confirm that these enhanced attacks do not adversely affect the voice quality or speaker similarity of the converted speech, making them perceptually natural while still highly effective against ASV systems.
Approach
The authors propose a feedback-controlled voice conversion (VC) system where the output scores from a black-box Automatic Speaker Verification (ASV) system are used as feedback. The VC network, based on Phonetic PosteriorGram (PPG), is trained using a combined loss function that includes both a standard VC MSE loss and a loss derived from the ASV output score, thereby maximizing the deceptiveness of the converted speech.
Datasets
ASVspoof 2019 logical access subset (for VC and ASV attack experiments), Switchboard and NIST SRE corpus 2006-2012 (for i-vector extractor training).
Model(s)
The voice conversion system uses a network with two Bidirectional Long Short-Term Memory (BLSTM) layers. The Automatic Speaker Verification (ASV) system, which is the target of the attack, is an i-vector based system implemented using Kaldi.
Author countries
Singapore, China