VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect

Authors: Qingyuan Fei, Wenjie Hou, Xuan Hai, Xin Liu

Published: 2025-02-14 17:43:01+00:00

AI Summary

VocalCrypt is a novel active defense method against AI voice cloning that embeds imperceptible pseudo-timbre into audio, preventing voice cloning without compromising audio quality. It significantly improves robustness and real-time performance compared to existing methods.

Abstract

The rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields. While these developments have led to notable progress, they have also raised concerns about the misuse of AI VC technology, causing economic losses and negative public perceptions. To address this challenge, this study focuses on creating active defense mechanisms against AI VC systems. We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear, thereby forming systematic fragments to prevent voice cloning. This approach protects the voice without compromising its quality. In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance, achieving a 500% increase in generation speed while maintaining interference effectiveness. Unlike audio watermarking techniques, which focus on post-detection, our method offers preemptive defense, reducing implementation costs and enhancing feasibility. Extensive experiments using the Zhvoice and VCTK Corpus datasets show that our AI-cloned speech defense system performs excellently in automatic speaker verification (ASV) tests while preserving the integrity of the protected audio.


Key findings
VocalCrypt significantly outperforms existing methods in preventing AI voice cloning, achieving a 500% increase in generation speed. It maintains high audio quality and demonstrates strong robustness against noise reduction and compression attacks.
Approach
VocalCrypt embeds pseudo-timbre (jamming information) based on the masking effect of the human auditory system into audio segments. This inaudible information disrupts voice cloning attempts while maintaining original audio quality. It uses Discrete Wavelet Transform (DWT), a filter bank for critical band division, and Quantization Index Modulation (QIM) for embedding.
Datasets
Zhvoice and VCTK Corpus datasets
Model(s)
ElevenLabs, GPT-SoVITS, XTTSv2, SEED-VC, and StyleTTS2 (for evaluation of defense effectiveness)
Author countries
China