DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection

Authors: Marcel Klemt, Carlotta Segna, Anna Rohrbach

Published: 2025-06-06 08:10:54+00:00

AI Summary

This paper addresses reproducibility and dataset issues in audio-video deepfake detection. It introduces SIMBA, a simple multimodal baseline model, and proposes improved evaluation protocols for the DeepSpeak v1 and FakeAVCeleb datasets, focusing on mitigating the "silence shortcut" problem.

Abstract

Generative AI advances rapidly, allowing the creation of very realistic manipulated video and audio. This progress presents a significant security and ethical threat, as malicious users can exploit DeepFake techniques to spread misinformation. Recent DeepFake detection approaches explore the multimodal (audio-video) threat scenario. In particular, there is a lack of reproducibility and critical issues with existing datasets - such as the recently uncovered silence shortcut in the widely used FakeAVCeleb dataset. Considering the importance of this topic, we aim to gain a deeper understanding of the key issues affecting benchmarking in audio-video DeepFake detection. We examine these challenges through the lens of the three core benchmarking pillars: datasets, detection methods, and evaluation protocols. To address these issues, we spotlight the recent DeepSpeak v1 dataset and are the first to propose an evaluation protocol and benchmark it using SOTA models. We introduce SImple Multimodal BAseline (SIMBA), a competitive yet minimalistic approach that enables the exploration of diverse design choices. We also deepen insights into the issue of audio shortcuts and present a promising mitigation strategy. Finally, we analyze and enhance the evaluation scheme on the widely used FakeAVCeleb dataset. Our findings offer a way forward in the complex area of audio-video DeepFake detection.


Key findings
SIMBA demonstrates competitive performance. Temporal jittering effectively mitigates the silence shortcut. The proposed evaluation protocols reveal challenges in cross-manipulation and cross-dataset generalization, highlighting the need for more diverse and robust datasets.
Approach
The authors propose SIMBA, a minimalistic multimodal model with separate audio and video encoders and late fusion. They investigate various design choices like temporal sampling and augmentation strategies to improve robustness and generalization, particularly addressing the "silence shortcut" issue in existing datasets.
Datasets
FakeAVCeleb, DeepSpeak v1
Model(s)
SIMBA (Simple Multimodal Baseline), LipForensics, RealForensics, AVoiD-DF, AVAD, AVFF
Author countries
Germany