WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

Authors: Juho Jung, Sangyoun Lee, Jooeon Kang, Yunjin Na

Published: 2024-08-06 04:44:10+00:00

Comment: 4 pages, 2 figures, 2 tables, Accepted as Oral Presentation at The Trustworthy AI Workshop @ IJCAI 2024

AI Summary

This paper introduces FakeMix, a novel clip-level evaluation benchmark for multimodal deepfake detection, to address the inflated accuracies of existing video-level benchmarks. It also proposes new metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), for assessing model robustness at a granular level within both video and audio segments. Evaluation reveals that state-of-the-art models, despite high video-level performance, show significant accuracy drops when tested on FakeMix using these clip-level metrics.

Abstract

All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.


Key findings
While state-of-the-art models achieve high video-level accuracies (e.g., 94.2% AP for AVAD on FakeAVCeleb), their performance sharply declines to around 50-60% when evaluated at the clip-level using the FakeMix benchmark and the proposed TA and FDM metrics. This demonstrates that existing benchmarks overestimate detection capabilities and that current models struggle with dynamic, fine-grained deepfake manipulations.
Approach
The authors introduce FakeMix, a clip-level evaluation benchmark that features deepfakes applied randomly to specific segments of both video and audio within one-second intervals. To evaluate models on this benchmark, they propose two novel metrics: Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), which measure detection precision at the individual frame/clip level.
Datasets
FakeMix, FakeAVCeleb, DFDC, KoDF
Model(s)
Xception, AVAD
Author countries
South Korea