ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Authors: Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu

Published: 2025-03-18 16:55:07+00:00

Comment: Accepted at WACV 2026

AI Summary

This paper introduces ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video, comprising approximately 5.4K real and deepfake videos. ExDDV features manual annotations with text descriptions of artifacts and precise click localizations. The research evaluates various vision-language models, demonstrating that combining both text and click supervision is essential for developing robust explainable models capable of localizing and describing deepfake artifacts.

Abstract

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.


Key findings
The collected annotations in ExDDV are highly effective for training explainable AI models, with fine-tuned VLMs significantly outperforming pre-trained versions. Incorporating both text and click supervision leads to substantial performance gains, enabling models to accurately localize and describe visual artifacts. The ExDDV dataset provides sufficient data samples for training robust explainable deepfake detection models.
Approach
The authors address the need for explainable deepfake detection by introducing ExDDV, a novel dataset with rich manual annotations for video artifacts. They benchmark vision-language models (VLMs) using both fine-tuning and in-context learning strategies, and integrate click supervision through hard or soft masking of regions of interest to guide models towards precise artifact explanations.
Datasets
ExDDV (their introduced dataset), DeeperForensics, FaceForensics++, DeepFake Detection Challenge (DFDC), BioDeepAV
Model(s)
BLIP-2, Phi-3-Vision, LLaVA-1.5 (for VLM evaluation); ViT, ResNet-50 (for click prediction); CLIP (ResNet backbone for k-NN retriever)
Author countries
Romania