VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

Authors: Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, Qiben Yan

Published: 2023-05-09 19:31:58+00:00

AI Summary

VSMask is a real-time voice protection mechanism against voice synthesis attacks. Unlike existing offline methods, it uses a predictive neural network to forecast perturbations for upcoming speech, minimizing latency and enabling protection of live audio streams.

Abstract

Deep learning based voice synthesis technology generates artificial human-like speeches, which has been used in deepfakes or identity theft attacks. Existing defense mechanisms inject subtle adversarial perturbations into the raw speech audios to mislead the voice synthesis models. However, optimizing the adversarial perturbation not only consumes substantial computation time, but it also requires the availability of entire speech. Therefore, they are not suitable for protecting live speech streams, such as voice messages or online meetings. In this paper, we propose VSMask, a real-time protection mechanism against voice synthesis attacks. Different from offline protection schemes, VSMask leverages a predictive neural network to forecast the most effective perturbation for the upcoming streaming speech. VSMask introduces a universal perturbation tailored for arbitrary speech input to shield a real-time speech in its entirety. To minimize the audio distortion within the protected speech, we implement a weight-based perturbation constraint to reduce the perceptibility of the added perturbation. We comprehensively evaluate VSMask protection performance under different scenarios. The experimental results indicate that VSMask can effectively defend against 3 popular voice synthesis models. None of the synthetic voice could deceive the speaker verification models or human ears with VSMask protection. In a physical world experiment, we demonstrate that VSMask successfully safeguards the real-time speech by injecting the perturbation over the air.


Key findings
VSMask effectively defended against three popular voice synthesis models, preventing synthetic voices from deceiving speaker verification systems or human listeners. Real-world experiments demonstrated successful real-time protection, even with over-the-air transmission. The weight-based constraint successfully minimized perceptible audio distortion.
Approach
VSMask employs a predictive neural network to forecast effective perturbations for incoming speech. These perturbations are added in real-time, along with a universal perturbation header at the speech's beginning, to mislead voice synthesis models. A weight-based constraint minimizes audio distortion.
Datasets
CSTR VCTK Corpus, LibriSpeech
Model(s)
AdaIN-VC, AutoVC, SV2TTS (for voice synthesis); SpeechBrain (for speaker verification)
Author countries
USA