VoiceWukong: Benchmarking Deepfake Voice Detection

Authors: Ziwei Yan, Yanjie Zhao, Haoyu Wang

Published: 2024-09-10 09:07:12+00:00

AI Summary

VoiceWukong is a new benchmark dataset for deepfake voice detection, addressing limitations in existing datasets by including diverse languages (English and Chinese) and various manipulations. Evaluation of 12 state-of-the-art detectors revealed significant challenges in real-world application, with most exceeding a 20% equal error rate.

Abstract

With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.


Key findings
AASIST2 achieved the best equal error rate (EER) of 13.50% on the English dataset and 13.54% on the Chinese dataset, but this significantly underperformed compared to its original evaluation. Most detectors showed EERs above 20%, highlighting the challenges in real-world deepfake voice detection. Human participants in a user study outperformed most detectors on easily detectable deepfakes but underperformed on more sophisticated ones.
Approach
The authors created VoiceWukong, a benchmark dataset containing deepfake voices generated by commercial and open-source tools, incorporating six types of manipulations. They evaluated 12 state-of-the-art deepfake voice detectors on this dataset and conducted a user study to assess human perception of deepfakes.
Datasets
VCTK (English), MAGICDATA (Chinese), ASVspoof2019-LA, ESC-50
Model(s)
AASIST, RawNet2, RawBoost, OC-Softmax, RawGAT-ST, SAMO, Res-TSSDNet, RawNet2-Vocoder, AASIST2, Raw PC-DARTS, RawBMamba, CLAD, Qwen2-Audio
Author countries
China