SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Authors: Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu

Published: 2024-09-14 02:45:09+00:00

AI Summary

SafeEar is a novel framework for audio deepfake detection that preserves content privacy by using only acoustic information (prosody and timbre) for detection, decoupled from semantic content using a neural audio codec. This approach achieves a low equal error rate (EER) while preventing content recovery by both machine and human analysis.

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.


Key findings
SafeEar achieves an EER as low as 2.02% on various deepfake datasets, comparable to state-of-the-art methods that don't protect privacy. It effectively prevents content recovery with word error rates above 93.93%, resisting attacks from naive, knowledgeable, and adaptive adversaries. The effectiveness is also validated by a user study.
Approach
SafeEar decouples audio into semantic and acoustic information using a neural audio codec. Only the acoustic information is used for deepfake detection, preventing the exposure of private speech content. Real-world codec augmentation enhances the detector's robustness.
Datasets
ASVspoof 2019, ASVspoof 2021, Librispeech, CVoiceFake (a multilingual dataset created by the authors)
Model(s)
Transformer-based deepfake detector with 8 multi-head self-attention heads; a codec-based decoupling model using a convolutional encoder-decoder architecture and HuBERT-equipped RVQs.
Author countries
China