Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method

Authors: Eivind Moholdt, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen

Published: 2023-08-31 10:16:59+00:00

AI Summary

This paper proposes a novel approach for detecting out-of-context (OOC) image-caption pairs in news by leveraging generative image models like DALL-E 2 and Stable Diffusion. Two new datasets containing 6800 synthetically generated images are presented to facilitate further research in cheapfake detection.

Abstract

The growth of misinformation and re-contextualized media in social media and news leads to an increasing need for fact-checking methods. Concurrently, the advancement in generative models makes cheapfakes and deepfakes both easier to make and harder to detect. In this paper, we present a novel approach using generative image models to our advantage for detecting Out-of-Context (OOC) use of images-caption pairs in news. We present two new datasets with a total of $6800$ images generated using two different generative models including (1) DALL-E 2, and (2) Stable-Diffusion. We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes. We run a preliminary qualitative and quantitative analysis to evaluate the performance of each image generation model for this task, and evaluate a handful of methods for computing image similarity.


Key findings
The study found that using only object encoders, particularly CLIP, for image similarity comparison yielded better accuracy and runtime than using object detection models. The results show a correlation between human perception of image similarity and the model's predictions, indicating the effectiveness of the proposed method in detecting OOC image-caption pairs. However, the method struggles with detecting contradictions within the caption pairs.
Approach
The approach uses generative image models (DALL-E 2 and Stable Diffusion) to generate images from captions. It then compares the perceptual similarity between these generated images and the original image using feature-based similarity (object detection and encoders like ResNet, DenseNet, EfficientNet, and CLIP) and cosine similarity to predict whether the image-caption pairs are OOC.
Datasets
Two new datasets with 3400 images each, generated using DALL-E 2 and Stable Diffusion, and the COSMOS dataset.
Model(s)
DALL-E 2, Stable Diffusion, ResNet, DenseNet, EfficientNet, CLIP, YOLOv5, YOLOv7, MASK-RCNN.
Author countries
Norway