Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

Published: 2023-05-30 13:20:31+00:00

Comment: 5 pages

AI Summary

This study proposes a timbre-reserved adversarial attack method for black-box speaker identification (SID) systems. It generates fake audio by integrating an adversarial constraint into a voice conversion (VC) model to preserve timbre, while a pseudo-Siamese network trains a substitute SID model to mimic the black-box target. This approach allows for effective attacks that deceive both machines and humans by exploiting SID vulnerabilities while maintaining high audio quality.

Abstract

In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.


Key findings
The proposed method achieved attack success rates of up to 60.58% in white-box and 55.38% in black-box scenarios on the ADD challenge dataset, significantly outperforming a vanilla VC baseline (29.60%). The substitute SID classifier accurately approximated the black-box model (91.44% agreement). Furthermore, the generated timbre-reserved fake audio maintained superior quality (higher o-MOS) compared to methods using direct adversarial perturbations, demonstrating the ability to deceive both machines and humans.
Approach
The approach involves two main steps: first, generating timbre-reserved fake audio by adding an adversarial constraint during the training of a voice conversion (VC) model. Second, a pseudo-Siamese network learns a substitute speaker identification (SID) model from a black-box SID system by constraining both intrinsic and structural similarities. This substitute model then serves as a proxy to guide the generation of the timbre-preserved adversarial audio to attack the original black-box SID model.
Datasets
AISHELL-3 (for training SID, VC, and pseudo-Siamese network), AISHELL-1 (for SID model evaluation), Audio Deepfake Detection (ADD) challenge dataset (for evaluating the adversarial attack, specifically Track 3.1).
Model(s)
For speaker identification, ECAPA-TDNN (black-box SID model) and a Conformer-based model (substitute speaker classifier similar to x-vector) are used. The text-to-speech (TTS) component utilizes a 6-layer conformer encoder-decoder (similar to DelightfulTTS 2), and the voice conversion (VC) model employs a non-autoregressive 8-layer transformer encoder-decoder (similar to FastSpeech2) with a HifiGAN vocoder, and leverages a pre-trained ASR model for linguistic information.
Author countries
China