ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

View on arXiv ← Back to list

Authors: Ahmad ALBarqawi, Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, NhatHai Phan

Published: 2025-07-24 02:04:58+00:00

AI Summary

ViGText is a novel deepfake image detection approach that integrates image patches with Vision Large Language Model (VLLM) explanations within a graph-based framework using Graph Neural Networks (GNNs). This approach significantly improves generalization and robustness against adversarial attacks compared to existing methods.

Abstract

The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model's superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.

Key findings

ViGText significantly outperforms state-of-the-art methods in deepfake image detection, achieving F1 scores up to 99.26% on the Stable Diffusion dataset and 99.6% on the StyleCLIP dataset. It shows superior generalization to unseen, fine-tuned variations of generative models and high robustness against adversarial attacks, limiting performance degradation to less than 4%.

Approach

ViGText divides images into patches, creating image and text graphs based on patch features and VLLM-generated explanations. These graphs are integrated and analyzed using GNNs to detect deepfakes, leveraging both spatial and frequency domain features for enhanced robustness.

Datasets

Stable Diffusion dataset (with extensions for Stable Diffusion 3.5), StyleCLIP dataset

Model(s)

ConvNeXt-Large (feature extraction), Qwen2-VL-7B-Instruct (VLLM), Graph Attention Networks (GNNs)

Author countries

USA, Qatar

← Previous