A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

View on arXiv ← Back to list

Authors: Jingchun Lian, Lingyu Liu, Yaxiong Wang, Yujiao Wu, Li Zhu, Zhedong Zheng

Published: 2024-12-27 15:23:39+00:00

AI Summary

This research introduces a new large-scale multimodal dataset, MMTT, comprising 128,303 image-text pairs of deepfake facial images with detailed textual annotations explaining manipulated regions. They also propose ForgeryTalker, a novel architecture for concurrent forgery localization and interpretation, achieving superior performance on the MMTT dataset.

Abstract

Image forgery localization, which centers on identifying tampered pixels within an image, has seen significant advancements. Traditional approaches often model this challenge as a variant of image segmentation, treating the binary segmentation of forged areas as the end product. We argue that the basic binary forgery mask is inadequate for explaining model predictions. It doesn't clarify why the model pinpoints certain areas and treats all forged pixels the same, making it hard to spot the most fake-looking parts. In this study, we mitigate the aforementioned limitations by generating salient region-focused interpretation for the forgery images. To support this, we craft a Multi-Modal Tramper Tracing (MMTT) dataset, comprising facial images manipulated using deepfake techniques and paired with manual, interpretable textual annotations. To harvest high-quality annotation, annotators are instructed to meticulously observe the manipulated images and articulate the typical characteristics of the forgery regions. Subsequently, we collect a dataset of 128,303 image-text pairs. Leveraging the MMTT dataset, we develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation. ForgeryTalker first trains a forgery prompter network to identify the pivotal clues within the explanatory text. Subsequently, the region prompter is incorporated into multimodal large language model for finetuning to achieve the dual goals of localization and interpretation. Extensive experiments conducted on the MMTT dataset verify the superior performance of our proposed model. The dataset, code as well as pretrained checkpoints will be made publicly available to facilitate further research and ensure the reproducibility of our results.

Key findings

ForgeryTalker outperforms baseline models in both forgery localization (IoU) and interpretation generation (CIDEr, BLEU). Ablation studies confirm the contribution of each component, particularly the Forgery Prompter Network, in improving performance. The MMTT dataset, code, and pretrained checkpoints are publicly available.

Approach

ForgeryTalker uses a Forgery Prompter Network to identify salient manipulated regions, a Mask Decoder to refine pixel-level predictions, and a multimodal large language model to generate interpretive reports explaining the detected forgeries. A two-stage training process is employed, first training the Forgery Prompter Network and then jointly optimizing the decoder and language model.

Datasets

Multi-Modal Tampering Tracing (MMTT) dataset, CelebAMask-HQ, Flickr-Faces-HQ

Model(s)

ForgeryTalker (combines Vision Transformer, a Forgery Prompter Network, Mask Decoder (SAM's Two-way Transformer), and a multimodal large language model)

Author countries

China, Australia, Macau

← Previous