Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method

Authors: Eivind Moholdt, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen

Published: 2023-08-31 10:16:59+00:00

Comment: ACM International Conference on Content-Based Multimedia Indexing (CBMI '23)

AI Summary

This paper introduces a novel approach for detecting Out-of-Context (OOC) image-caption pairs in news by leveraging generative image models. The method involves generating synthetic images from captions and then comparing their perceptual similarity to identify cheapfakes. The authors also contribute two new datasets comprising images generated by DALL-E 2 and Stable Diffusion to facilitate further research in this area.

Abstract

The growth of misinformation and re-contextualized media in social media and news leads to an increasing need for fact-checking methods. Concurrently, the advancement in generative models makes cheapfakes and deepfakes both easier to make and harder to detect. In this paper, we present a novel approach using generative image models to our advantage for detecting Out-of-Context (OOC) use of images-caption pairs in news. We present two new datasets with a total of $6800$ images generated using two different generative models including (1) DALL-E 2, and (2) Stable-Diffusion. We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes. We run a preliminary qualitative and quantitative analysis to evaluate the performance of each image generation model for this task, and evaluate a handful of methods for computing image similarity.


Key findings
The study found that utilizing only object encoders for image similarity generally outperformed combining them with object detection models, offering better accuracy and runtime. The CLIP model achieved the best overall performance, with accuracy up to 68.2% on the DALL-E 2 generated dataset. While the performance difference between DALL-E 2 and Stable Diffusion was negligible, the method demonstrated limitations in capturing contradictions within caption pairs.
Approach
The proposed method generates synthetic images from news captions using DALL-E 2 and Stable Diffusion. Perceptual similarity between these generated images (or generated vs. original images) is then computed using feature extraction techniques. This involves object encoders (e.g., CLIP, ResNet, EfficientNet) and optionally object detection models (e.g., MASK-RCNN, YOLO) to create feature vectors, whose similarity is measured using Cosine Similarity to classify image-caption pairs as OOC or Not-Out-of-Context (NOOC).
Datasets
COSMOS dataset, custom datasets generated using DALL-E 2 and Stable Diffusion (totaling 6800 images).
Model(s)
UNKNOWN
Author countries
Norway