SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis

View on arXiv ← Back to list

Authors: Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang

Published: 2025-04-14 03:21:23+00:00

AI Summary

SafeSpeech is a proactive voice protection framework that embeds imperceptible perturbations into audio before uploading to prevent high-quality speech synthesis. It uses a surrogate model and a novel Speech PErturbative Concealment (SPEC) technique to generate universally applicable perturbations robust against adaptive adversaries.

Abstract

Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation (textit{e.g.}, telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, textit{textbf{SafeSpeech}}, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, textbf{S}peech textbf{PE}rturbative textbf{C}oncealment (textbf{SPEC}), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real-time capability in real-world tests. The source code is available at href{https://github.com/wxzyd123/SafeSpeech}{https://github.com/wxzyd123/SafeSpeech}.

Key findings

SafeSpeech achieves state-of-the-art voice protection effectiveness and transferability across various TTS models. It is highly robust against advanced adaptive adversaries, including perturbation removal, data augmentation, and model recovery techniques. Real-world tests demonstrate real-time capability.

Approach

SafeSpeech adds imperceptible perturbations to audio using a surrogate model to minimize a pivotal objective function (focused on mel-spectrogram distance) and a SPEC technique leveraging Kullback-Leibler divergence to make synthesized speech noise-like. Human perception is optimized using STOI and STFT loss functions.

Datasets

LibriTTS and CMU ARCTIC

Model(s)

BERT-VITS2, StyleTTS 2, MB-iSTFT-VITS, VITS, GlowTTS, TorToise-TTS, XTTS, OpenVoice, FishSpeech, and F5-TTS

Author countries

China, Australia, Singapore, Hong Kong

← Previous