SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Authors: Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu

Published: 2024-09-14 02:45:09+00:00

Comment: Accepted by ACM CCS 2024. Please cite this paper as "Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu. SafeEar: Content Privacy-Preserving Audio Deepfake Detection. In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2024."

AI Summary

This paper introduces SafeEar, a novel framework for content privacy-preserving audio deepfake detection. It achieves this by decoupling speech into semantic and acoustic information using a neural audio codec, then employing only the acoustic information for deepfake detection. SafeEar demonstrates high effectiveness in detecting various deepfake techniques while simultaneously shielding speech content from machine and human recovery attempts.

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.


Key findings
SafeEar achieved an Equal Error Rate (EER) as low as 2.02% on four benchmark datasets, demonstrating comparable performance to state-of-the-art detectors. Simultaneously, it successfully protected five-language speech content from machine and human auditory analysis, with Word Error Rates (WERs) for content recovery all above 93.93%. The framework also demonstrated robustness against various deepfake generation techniques and transmission codecs.
Approach
SafeEar uses a novel codec-based decoupling model (CDM) to separate speech into semantic and acoustic tokens. It then applies a bottleneck and shuffle layer to the acoustic tokens to further protect content privacy by obfuscating temporal patterns. A Transformer-based detector, enhanced with real-world codec augmentation during training, then analyzes these shuffled acoustic tokens to detect deepfakes without accessing semantic content.
Datasets
ASVspoof 2019, ASVspoof 2021, CVoiceFake (multilingual), Librispeech
Model(s)
Neural Audio Codec (encoder-decoder with HuBERT-equipped RVQs and discriminator), Transformer-based detector (with Multi-Head Self-Attention)
Author countries
China