Detecting and Grounding Multi-Modal Media Manipulation

Authors: Rui Shao, Tianxing Wu, Ziwei Liu

Published: 2023-04-05 16:20:40+00:00

AI Summary

This paper introduces a novel research problem, Detecting and Grounding Multi-Modal Media Manipulation (DGM4), focusing on detecting and locating manipulated content in image-text pairs. A new large-scale DGM4 dataset is created, and a hierarchical multi-modal reasoning transformer (HAMMER) model is proposed to effectively address this problem.

Abstract

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.


Key findings
HAMMER significantly outperforms baseline multi-modal learning methods (CLIP, ViLT) and uni-modal deepfake detection/sequence tagging methods (TS, MAT, BERT, LUKE) across various metrics. Ablation studies confirm the importance of both image and text modalities and the effectiveness of the proposed loss functions.
Approach
The HAMMER model uses a hierarchical approach. It first performs manipulation-aware contrastive learning between image and text encoders for shallow reasoning. Then, a multi-modal aggregator with cross-attention performs deep reasoning, integrating manipulation detection and grounding heads at both levels.
Datasets
A new large-scale DGM4 dataset was constructed, based on the VisualNews dataset, with image-text pairs manipulated using face swap/attribute manipulation and text swap/attribute manipulation techniques. The dataset includes annotations for binary classification, fine-grained manipulation types, bounding boxes, and text tokens.
Model(s)
HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER), which incorporates ViT-B/16 for image encoding, a 6-layer transformer (initialized by BERTbase) for text encoding and multi-modal aggregation, and dedicated detection and grounding heads.
Author countries
China, Singapore