ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks

View on arXiv ← Back to list

Authors: Yuanda Wang, Bocheng Chen, Hanqing Guo, Guangjing Wang, Weikang Ding, Qiben Yan

Published: 2025-08-25 04:46:35+00:00

AI Summary

ClearMask is a noise-free defense against voice deepfake attacks that modifies audio mel-spectrograms by selectively filtering frequencies, uses audio style transfer to deceive decoders, and introduces optimized reverberation to disrupt voice generation models. LiveMask, a real-time version, uses a universal frequency filter and reverberation generator for immediate protection.

Abstract

Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.

Key findings

ClearMask and LiveMask effectively prevented voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen models and black-box APIs. The methods showed resilience against adaptive attackers attempting to recover the original audio signal. Audio quality remained high compared to existing methods.

Approach

ClearMask employs a three-stage approach: spectrogram masking to selectively filter frequencies in the mel-spectrogram, audio style transfer to further obscure voice features, and optimized reverberation to enhance protection. LiveMask simplifies this for real-time applications.

Datasets

VCTK-Corpus, LibriSpeech, Mozilla Common Voice

Model(s)

AdaIN-VC, AutoVC, SV2TTS (surrogate models); YourTTS, DiffVC, AGAIN-VC, ElevenLabs, Play.ht (test models); ECAPA-TDNN (speaker verification)

Author countries

USA

← Previous