A Comparative Study on Proactive and Passive Detection of Deepfake Speech

Authors: Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

Published: 2025-06-17 10:54:08+00:00

AI Summary

This research proposes a framework for comparing proactive (watermarking) and passive (conventional detection) deepfake speech detection models. It ensures fair comparison by training and testing all models on common datasets with a shared metric, and analyzes their robustness against adversarial attacks.

Abstract

Solutions for defending against deepfake speech fall into two categories: proactive watermarking models and passive conventional deepfake detectors. While both address common threats, their differences in training, optimization, and evaluation prevent a unified protocol for joint evaluation and selecting the best solutions for different cases. This work proposes a framework to evaluate both model types in deepfake speech detection. To ensure fair comparison and minimize discrepancies, all models were trained and tested on common datasets, with performance evaluated using a shared metric. We also analyze their robustness against various adversarial attacks, showing that different models exhibit distinct vulnerabilities to different speech attribute distortions. Our training and evaluation code is available at Github.


Key findings
Watermarking models and SSL-AASIST achieved near-perfect results on clean datasets. However, all models showed performance degradation under various transmission and manipulation conditions. Timbre demonstrated the highest robustness, but still suffered significant performance drops under certain attacks (e.g., pitch shift, WavTokenizer).
Approach
The authors created a unified evaluation framework to compare proactive watermarking and passive deepfake detection models. This involved training all models on shared datasets, employing a common evaluation metric (EER), and testing robustness under various adversarial attacks.
Datasets
ASVspoof 2019 LA dataset (training, development, and test sets), ASVspoof 2021 LA dataset
Model(s)
AASIST, SSL-AASIST, Timbre, AudioSeal
Author countries
Japan, Taiwan