Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

Authors: Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu

Published: 2025-11-17 14:49:57+00:00

AI Summary

This paper introduces Foresee, a novel training-free pipeline leveraging vanilla Multimodal Large Language Models (MLLMs) for image forgery detection and localization (IFDL). Foresee eliminates the need for extensive training and enables lightweight inference, significantly improving tamper localization accuracy and the richness of textual explanations compared to existing MLLM-based methods. It achieves this by employing a type-prior-driven strategy and a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations.

Abstract

With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.


Key findings
Foresee achieves superior localization accuracy and provides more comprehensive, accurate, and readable textual explanations compared to both vanilla MLLMs and other interpretable IFDL methods. It demonstrates stronger generalization capabilities across diverse tampering types (copy-move, splicing, deepfake, AIGC-based editing) and exhibits robust performance under common image degradations like JPEG compression and Gaussian noise.
Approach
Foresee is a training-free MLLM-based pipeline that decomposes the tampering detection process through a chain-of-thought paradigm. It first predicts the forgery type using a type-prior-driven strategy, which then guides the MLLM with task-specific prompts. A Flexible Feature Detector (FFD) module is integrated to enhance detection of copy-move manipulations, and finally, MLLM-guided inference combined with GroundingDINO and SAM generates both textual explanations and precise localization masks.
Datasets
CASIA1.0+, Columbia, IMD2020, Coverage, NIST16 (editing datasets); FaceApp (deepfake dataset); OpenForensics (AIGC-based editing dataset)
Model(s)
UNKNOWN
Author countries
China