Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

View on arXiv ← Back to list

Authors: Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam

Published: 2025-09-11 07:20:18+00:00

AI Summary

This paper introduces 'bona fide cross-testing', a novel evaluation framework for audio deepfake detection that uses diverse bona fide datasets to create more robust and interpretable evaluations compared to traditional methods. It addresses limitations in existing methods by incorporating diverse bona fide speech types and aggregating EERs for a more balanced assessment.

Abstract

Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.

Key findings

The bona fide cross-testing framework revealed vulnerabilities in existing audio deepfake detection models, particularly concerning the detection of bona fide audio from diverse environments and speech styles. Wav2Vec-SCL demonstrated the best robustness against spoof attacks among the models tested.

Approach

The proposed approach uses bona fide cross-testing, which pairs diverse bona fide speech datasets with various synthesizers to calculate separate EERs for each combination. Results are aggregated using maximum pooling, focusing on the highest EERs to identify the most challenging synthesizers for detection.

Datasets

AMI IHM, AMI SDM, LibriSpeech test-clean, LibriSpeech test-other, VCTK, FakeAVCeleb-v1.2, In-The-Wild, EmoFake-EN, AV-Deefake-1M; ASVspoof2019 LA, ASVspoof2021 DF, FakeAVCeleb-v1.2, EmoFake-EN, AV-Deefake-1M, CodecFake, MLAAD-v3-EN, LlamaPartialSpoof.

Model(s)

Wav2Vec-Conformer, Wav2Vec-TCM, Wav2Vec-SCL

Author countries

Singapore

← Previous