Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

View on arXiv ← Back to list

Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

Published: 2023-05-30 13:20:31+00:00

AI Summary

This paper proposes a timbre-reserved adversarial attack for speaker identification (SID) that generates fake audio while preserving the target speaker's timbre, even in black-box settings. This is achieved using a pseudo-Siamese network to learn from a black-box SID model, constraining both intrinsic and structural similarity, and incorporating adversarial constraints during voice conversion model training.

Abstract

In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.

Key findings

The proposed attack achieved success rates of up to 60.58% and 55.38% in white-box and black-box scenarios, respectively. The generated fake audio successfully deceived both human listeners and machine-based SID models, maintaining high audio quality.

Approach

The approach uses a voice conversion (VC) model with an adversarial constraint to generate timbre-preserved fake audio. A pseudo-Siamese network learns from a black-box SID model using intrinsic and structural similarity losses, creating a substitute model to generate the adversarial audio.

Datasets

AISHELL-3 (for training), AISHELL-1 test set (for SID model evaluation), Audio Deepfake Detection (ADD) challenge dataset Track 3.1 (for attack evaluation).

Model(s)

Pseudo-Siamese network, voice conversion model (based on FastSpeech2 and Hifi-GAN vocoder with conformer blocks), ECAPA-TDNN (black-box SID model), TTS model (based on FastSpeech and DelightfulTTS 2).

Author countries

China

← Previous