WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

Authors: Juho Jung, Sangyoun Lee, Jooeon Kang, Yunjin Na

Published: 2024-08-06 04:44:10+00:00

AI Summary

This paper introduces FakeMix, a novel clip-level benchmark for multimodal deepfake detection that focuses on identifying manipulated segments within video and audio, addressing the limitations of existing benchmarks that only evaluate full-video manipulations. It also proposes new evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess model robustness.

Abstract

All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.


Key findings
Existing models achieve high accuracy on video-level deepfake detection benchmarks but show significantly reduced performance when evaluated at the clip-level using FakeMix and the proposed TA and FDM metrics. This highlights the limitations of existing benchmarks and the importance of clip-level evaluation for more robust and reliable deepfake detection.
Approach
The authors introduce FakeMix, a new benchmark dataset where deepfakes are applied to random segments of audio and video clips. They also propose two new evaluation metrics, TA and FDM, to assess the performance of deepfake detection models at a clip and frame level, respectively.
Datasets
FakeMix (proposed), FakeAVCeleb, DFDC, KoDF
Model(s)
Xception, AVAD
Author countries
South Korea