Detecting and Grounding Multi-Modal Media Manipulation and Beyond

Authors: Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, Ziwei Liu

Published: 2023-09-25 15:05:46+00:00

AI Summary

This paper introduces a novel research problem, Detecting and Grounding Multi-Modal Media Manipulation (DGM^4), focusing on detecting and locating manipulated content in image-text pairs. The authors propose a hierarchical multi-modal reasoning transformer (HAMMER) and its enhanced version HAMMER++, achieving superior performance in detecting and grounding manipulations compared to existing methods.

Abstract

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++.


Key findings
HAMMER and HAMMER++ significantly outperform existing multi-modal learning and single-modality deepfake/text forgery detection methods on the DGM^4 dataset. The incorporation of local view contrastive learning further improves performance, highlighting the importance of fine-grained semantic alignment. The models demonstrate good generalization to unseen manipulations in external datasets.
Approach
The proposed HAMMER and HAMMER++ models use hierarchical manipulation reasoning. Shallow reasoning aligns image and text embeddings through contrastive learning, while deep reasoning uses cross-attention for multi-modal aggregation. Dedicated detection and grounding heads are integrated at both levels.
Datasets
A newly constructed DGM^4 dataset of image-text pairs manipulated using various techniques (face swap, face attribute manipulation, text swap, text attribute manipulation), with rich annotations for detection and grounding.
Model(s)
HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) and its improved version HAMMER++, both based on transformer architectures.
Author countries
China, Singapore