SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Authors: Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, Guangliang Cheng

Published: 2024-12-05 16:12:25+00:00

AI Summary

This paper introduces SID-Set, a large and diverse social media image deepfake dataset with 300K images, and SIDA, a novel deepfake detection framework that excels in detection, localization, and explanation of image manipulations using large multimodal models.

Abstract

The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.


Key findings
SIDA outperforms or matches state-of-the-art methods on SID-Set and other benchmarks in deepfake detection and localization. Its robustness to common image perturbations highlights its practical applicability. The model effectively provides textual explanations for its predictions.
Approach
SIDA leverages large multimodal models by adding <DET> and <SEG> tokens to the vocabulary to extract detection and segmentation information. It uses a detection head to classify images and a segmentation head to generate masks for tampered regions, alongside textual explanations generated by an LLM.
Datasets
SID-Set (300K images: 100K real, 100K synthetic, 100K tampered), OpenImages V7, Flickr30k, COCO, MagicBrush
Model(s)
LISA-7B-v1, LISA-13B-v1 (fine-tuned with LoRA), FLUX, Latent Diffusion, GPT-4o, Language-SAM
Author countries
UK, Singapore, China