ExDDV: A New Dataset for Explainable Deepfake Detection in Video

View on arXiv ← Back to list

Authors: Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu

Published: 2025-03-18 16:55:07+00:00

AI Summary

This paper introduces ExDDV, the first dataset for explainable deepfake detection in video, containing 5.4K videos with manual annotations including text descriptions and click localizations of artifacts. Experiments with vision-language models show that both text and click supervision are crucial for robust explainable deepfake detection models.

Abstract

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

Key findings

Fine-tuning VLMs on ExDDV yielded the most accurate explanations. Both text and click supervision are needed for jointly localizing and describing artifacts. A ViT-based click predictor accurately localized visual artifacts with a mean absolute error of only 12 pixels.

Approach

The authors evaluate several vision-language models (VLMs) on ExDDV, employing various fine-tuning and in-context learning strategies. They explore the impact of incorporating click annotations as a supplementary supervision signal using soft and hard masking techniques to improve localization and explanation accuracy.

Datasets

ExDDV (created by combining DeeperForensics, FaceForensics++, DeepFake Detection Challenge, and BioDeepAV datasets)

Model(s)

BLIP-2, Phi-3-Vision, LLaVA-1.5

Author countries

Romania, Romania

← Previous