Training-Free Multimodal Deepfake Detection via Graph Reasoning

View on arXiv ← Back to list

Authors: Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Yanyan Wei, Zhangling Duan, Zhaohong Jia

Published: 2025-09-26 02:22:12+00:00

AI Summary

The paper proposes GASP-ICL, a training-free framework for Multimodal Deepfake Detection (MDD) that utilizes Large Vision-Language Models (LVLMs) via In-Context Learning (ICL). GASP-ICL employs an MDD-adapted feature extractor and the Graph-Structured Taylor Adaptive Scorer (GSTAS) to retrieve highly discriminative, task-relevant image-text exemplars. This approach aims to capture subtle forgery cues and cross-modal inconsistencies without requiring LVLM fine-tuning.

Abstract

Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities, thereby reinforcing the reliability of modern information systems. Although large vision-language models (LVLMs) exhibit strong multimodal reasoning, their effectiveness in MDD is limited by challenges in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval. To this end, we propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD. GASP-ICL employs a pipeline to preserve semantic relevance while injecting task-aware knowledge into LVLMs. We leverage an MDD-adapted feature extractor to retrieve aligned image-text pairs and build a candidate set. We further design the Graph-Structured Taylor Adaptive Scorer (GSTAS) to capture cross-sample relations and propagate query-aligned signals, producing discriminative exemplars. This enables precise selection of semantically aligned, task-relevant demonstrations, enhancing LVLMs for robust MDD. Experiments on four forgery types show that GASP-ICL surpasses strong baselines, delivering gains without LVLM fine-tuning.

Key findings

GASP-ICL consistently improved detection performance across seven different LVLMs compared to vanilla zero-shot inference, demonstrating its effectiveness in providing task-aware guidance. The best results were achieved using a three-shot setting (k2=3) and a task-adapted CLIP feature extractor, highlighting the importance of optimizing both sample selection and feature alignment for robust MDD.

Approach

The training-free framework first leverages a fine-tuned CLIP encoder to embed query samples and candidates into a joint multimodal space for similarity-based retrieval of coarse candidates (k1). Next, the Graph-Structured Taylor Adaptive Scorer (GSTAS) constructs a fused graph over these candidates to model cross-sample relations and propagate query-aligned signals via Taylor gating. The resulting top-k discriminative exemplars are then used as context prompts to guide the frozen LVLM for final binary classification.

Datasets

DGM4

Model(s)

LVLMs (Qwen2.5-VL, InternVL3, Gemma-3, LLaVA-v1.6, Janus-Pro, Owl2.1, Kimi-VL), CLIP

Author countries

China

← Previous