Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection

Authors: Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, Chip Hong Chang

Published: 2025-03-19 03:20:03+00:00

AI Summary

This paper proposes a novel framework for deepfake detection using Large Vision-Language Models (LVLMs). The framework integrates a Knowledge-guided Forgery Detector and a Forgery Prompt Learner to enhance the LVLMs' ability to detect and localize forgeries, resulting in improved generalization and explainability.

Abstract

Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.


Key findings
The proposed framework outperforms state-of-the-art methods in generalization performance across multiple deepfake datasets. It also supports multi-turn dialogue, providing explainable detection results. The model shows robustness even with limited training data.
Approach
The authors propose a three-component framework: a Knowledge-guided Forgery Detector (KFD) to align image features with textual descriptions, a Forgery Prompt Learner (FPL) to generate fine-grained forgery prompts, and a Large Language Model (LLM) to generate textual detection responses. The KFD's output is used by the FPL, which along with visual and question prompts feeds into the LLM.
Datasets
FF++, CDF2, DFD, DFDCP, DFDC, DF40
Model(s)
ImageBind-Huge (image and text encoder), Vicuna-7B (LLM), PandaGPT architecture
Author countries
China, Macau, Singapore