Robust Localization of Partially Fake Speech: Metrics, Models, and Out-of-Domain Evaluation

Authors: Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng

Published: 2025-07-04 10:46:11+00:00

AI Summary

This paper analyzes limitations in evaluating partial audio deepfake localization, advocating for threshold-dependent metrics like accuracy and F1-score over Equal Error Rate (EER). It demonstrates that existing models, while strong in-domain, generalize poorly to out-of-domain data, and that increasing training data doesn't always improve performance.

Abstract

Partial audio deepfake localization pose unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold-dependent metrics such as accuracy, precision, recall, and F1-score, which better reflect real-world behavior. Specifically, we analyze the performance of the open-source Coarse-to-Fine Proposal Refinement Framework (CFPRF), which achieves a 20-ms EER of 7.61% on the in-domain PartialSpoof evaluation set, but 43.25% and 27.59% on the LlamaPartialSpoof and Half-Truth out-of-domain test sets. Interestingly, our reproduced version of the same model performs worse on in-domain data (9.84%) but better on the out-of-domain sets (41.72% and 14.98%, respectively). This highlights the risks of over-optimizing for in-domain EER, which can lead to models that perform poorly in real-world scenarios. It also suggests that while deep learning models can be effective on in-domain data, they generalize poorly to out-of-domain scenarios, failing to detect novel synthetic samples and misclassifying unfamiliar bona fide audio. Finally, we observe that adding more bona fide or fully synthetic utterances to the training data often degrades performance, whereas adding partially fake utterances improves it.


Key findings
Existing models exhibit poor generalization to out-of-domain data, even with increased training data. Threshold-dependent metrics provide a more realistic evaluation of model performance in real-world scenarios. Adding partially fake utterances to the training data improves model performance, while adding bona fide or fully synthetic utterances often degrades it.
Approach
The authors reframe partial audio deepfake localization as a sequential anomaly detection problem. They evaluate the performance of existing models (MRM and CFPRF) using threshold-dependent metrics, analyzing their generalization capabilities across in-domain and out-of-domain datasets. They also investigate the impact of data augmentation and adding different types of data to the training set.
Datasets
PartialSpoof, LlamaPartialSpoof, Half-Truth
Model(s)
Multi-solution Model (MRM), Coarse-to-Fine Proposal Refinement Framework (CFPRF)
Author countries
Singapore, Israel, China