Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake

Authors: Liping Chen, Kong Aik Lee, Zhen-Hua Ling, Xin Wang, Rohan Kumar Das, Tomoki Toda, Haizhou Li

Published: 2025-09-08 06:22:36+00:00

AI Summary

This paper provides a concise overview of three techniques for addressing security threats from deepfake speech: voice anonymization, deepfake detection, and watermarking. It describes their methodologies, advancements, and challenges, highlighting the need for further research into integrating these techniques.

Abstract

In the era of big data, remarkable advancements have been achieved in personalized speech generation techniques that utilize speaker attributes, including voice and speaking style, to generate deepfake speech. This has also amplified global security risks from deepfake speech misuse, resulting in considerable societal costs worldwide. To address the security threats posed by deepfake speech, techniques have been developed focusing on both the protection of voice attributes and the defense against deepfake speech. Among them, the voice anonymization technique has been developed to protect voice attributes from extraction for deepfake generation, while deepfake detection and watermarking have been utilized to defend against the misuse of deepfake speech. This paper provides a short and concise overview of the three techniques, describing the methodologies, advancements, and challenges. A comprehensive version, offering additional discussions, will be published in the near future.


Key findings
The paper identifies challenges in each technique, such as the vulnerability of anonymization methods to attacker access to the system, shortcut learning in deepfake detectors, and the limited robustness of current watermarking algorithms against sophisticated attacks. The need for further research to address these challenges and integrate the techniques is emphasized.
Approach
The paper surveys existing methods for voice anonymization, deepfake detection, and watermarking applied to audio. It categorizes approaches (generative vs. adversarial for anonymization, feature extraction and classification methods for detection, and post-processing vs. collaborative methods for watermarking) and discusses their strengths and weaknesses.
Datasets
ASVspoof challenges datasets, audio deepfake detection challenges datasets, VoicePrivacy Challenge 2024 dataset, and various other datasets mentioned in the references.
Model(s)
Various Deep Neural Networks (DNNs) including CNNs, ResNets, graph-based networks, state-space models, and encoder-decoder networks are mentioned in relation to deepfake detection and watermarking. Self-supervised learning (SSL)-based models are also used for feature extraction.
Author countries
China, Hong Kong, Japan, Singapore