ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks

Authors: Yuanda Wang, Bocheng Chen, Hanqing Guo, Guangjing Wang, Weikang Ding, Qiben Yan

Published: 2025-08-25 04:46:35+00:00

Comment: 14 Pages, Accepted by AsiaCCS 2025

AI Summary

This paper introduces ClearMask, a noise-free defense mechanism against voice deepfake attacks that preserves audio naturalness. It modifies the audio mel-spectrogram by selective frequency filtering, applies audio style transfer, and optimizes reverberation to induce transferable voice feature loss. Additionally, LiveMask is proposed for real-time streaming speech protection, both effectively preventing deepfake voices from deceiving speaker verification models and human listeners, even against unseen voice synthesis models and adaptive attackers.

Abstract

Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.


Key findings
ClearMask and LiveMask effectively prevent voice deepfake attacks, achieving nearly 100% rejection rates by both open-source (ECAPA-TDNN) and commercial (Soniox) speaker verification systems, and are indistinguishable to human listeners for synthesized speech. The approach demonstrates high transferability to unseen voice synthesis models and commercial APIs, while significantly outperforming existing defenses in audio quality (higher MOS, STOI, PEAQ scores) and robustness against adaptive attackers.
Approach
ClearMask defends against voice deepfake attacks through a three-stage noise-free process: spectrogram masking (selectively filtering frequencies), audio style transfer (deceiving voice decoders while preserving quality), and optimized reverberation generation (disrupting voice generation). For real-time scenarios, LiveMask uses a universal frequency filter and reverberation generator. An ensemble encoder approach enhances transferability across diverse deepfake models.
Datasets
VCTK-Corpus, LibriSpeech, Mozilla Common Voice
Model(s)
Surrogate models: AdaIN-VC (VAE), AutoVC (VAE), SV2TTS (LSTM). Test models: YourTTS (ResNet), DiffVC (VAE), AGAIN-VC (U-Net). Commercial APIs: ElevenLabs, Play.ht. Speaker Verification Models: ECAPA-TDNN (SpeechBrain), Soniox speaker identification service API.
Author countries
USA