Training-Free Multimodal Deepfake Detection via Graph Reasoning

Authors: Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Yanyan Wei, Zhangling Duan, Zhaohong Jia

Published: 2025-09-26 02:22:12+00:00

AI Summary

The paper proposes GASP-ICL, a training-free framework for multimodal deepfake detection (MDD) that enhances Large Vision-Language Models (LVLMs) via graph reasoning and in-context learning. It addresses the challenges of capturing subtle forgery cues and resolving cross-modal inconsistencies by adaptively selecting task-relevant demonstrations. GASP-ICL achieves significant performance improvements over strong baselines across various forgery types without requiring LVLM fine-tuning.

Abstract

Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities, thereby reinforcing the reliability of modern information systems. Although large vision-language models (LVLMs) exhibit strong multimodal reasoning, their effectiveness in MDD is limited by challenges in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval. To this end, we propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD. GASP-ICL employs a pipeline to preserve semantic relevance while injecting task-aware knowledge into LVLMs. We leverage an MDD-adapted feature extractor to retrieve aligned image-text pairs and build a candidate set. We further design the Graph-Structured Taylor Adaptive Scorer (GSTAS) to capture cross-sample relations and propagate query-aligned signals, producing discriminative exemplars. This enables precise selection of semantically aligned, task-relevant demonstrations, enhancing LVLMs for robust MDD. Experiments on four forgery types show that GASP-ICL surpasses strong baselines, delivering gains without LVLM fine-tuning.


Key findings
GASP-ICL consistently improves MDD performance across seven diverse Large Vision-Language Models and four forgery types, achieving significant gains without requiring LVLM fine-tuning. Task-specific CLIP adaptation substantially enhances cross-modal alignment and improves sensitivity to subtle manipulations. A three-shot setting for in-context learning and an optimal propagation range factor (α=0.4) for GSTAS yielded the best performance.
Approach
GASP-ICL is a training-free framework that enhances LVLMs for MDD by constructing discriminative contexts through a structured exemplar selection pipeline. It first uses an MDD-adapted feature extractor (fine-tuned CLIP) to retrieve semantically aligned image-text pairs. Then, a Graph-Structured Taylor Adaptive Scorer (GSTAS) refines these candidates by modeling cross-sample relations and propagating query-aligned signals, providing highly discriminative exemplars for LVLM in-context learning.
Datasets
DGM4
Model(s)
Large Vision-Language Models (LVLMs) such as Qwen2.5-VL-7B, InternVL3-8B, Gemma-3-12B, LlaVa-v1.6-7B, Janus-Pro-7B, Owl2.1-7B, and Kimi-VL-16B. CLIP encoders (fine-tuned on DGM4) are used as feature extractors.
Author countries
China