Shallow- and Deep-fake Image Manipulation Localization Using Vision Mamba and Guided Graph Neural Network

Authors: Junbin Zhang, Hamid Reza Tohidypour, Yixiao Wang, Panos Nasiopoulos

Published: 2026-01-05 21:38:50+00:00

Comment: Under review for journal publication

AI Summary

This paper introduces a deep learning solution for localizing manipulations in both shallowfake and deepfake images. The approach leverages a Vision Mamba network for robust feature extraction to define tampered boundaries and proposes a novel Guided Graph Neural Network (G-GNN) module to enhance the distinction between authentic and manipulated pixels. The method achieves higher inference accuracy compared to other state-of-the-art techniques.

Abstract

Image manipulation localization is a critical research task, given that forged images may have a significant societal impact of various aspects. Such image manipulations can be produced using traditional image editing tools (known as shallowfakes) or advanced artificial intelligence techniques (deepfakes). While numerous studies have focused on image manipulation localization on either shallowfake images or deepfake videos, few approaches address both cases. In this paper, we explore the feasibility of using a deep learning network to localize manipulations in both shallow- and deep-fake images, and proposed a solution for such purpose. To precisely differentiate between authentic and manipulated pixels, we leverage the Vision Mamba network to extract feature maps that clearly describe the boundaries between tampered and untouched regions. To further enhance this separation, we propose a novel Guided Graph Neural Network (G-GNN) module that amplifies the distinction between manipulated and authentic pixels. Our evaluation results show that our proposed method achieved higher inference accuracy compared to other state-of-the-art methods.

Key findings

The proposed method achieved superior pixel-level and image-level F1 scores and AUC on both shallowfake and deepfake image datasets, outperforming state-of-the-art solutions. Ablation studies confirmed that both the Vision Mamba backbone and the Guided Graph Neural Network significantly contribute to the performance gains. The approach also demonstrated robustness against common image distortions such as Gaussian noise, Gaussian blur, and JPEG compression.

Approach

The method frames image manipulation localization as a semantic segmentation task, using a UPerNet-based architecture. It employs two Vision Mamba (VSSD) networks as backbones for multi-level feature extraction, capitalizing on their large receptive fields. A novel Guided Graph Neural Network (G-GNN) module is integrated into the Feature Pyramid Network (FPN) to refine manipulation boundaries by guiding graph construction with ground-truth masks during training.

Datasets

CASIAv2, CASIAv1, Columbia, COVERAGE, NIST16 (for shallowfakes), and frames extracted from FaceForensics++ (including Youtube, Deepfakes, Face2Face, FaceSwap, NeuralTextures videos for deepfakes).

Model(s)

UPerNet (framework), Vision Mamba (VSSD variant, as backbone), Guided Graph Neural Network (G-GNN, extending Vision GNN - ViG), BayarConv (for noise extraction), Pyramid Pooling Module (PPM) head, Feature Pyramid Network (FPN).

Author countries

Canada

← Previous