COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning

Authors: Shivangi Aneja, Chris Bregler, Matthias Nießner

Published: 2021-01-15 19:00:42+00:00

AI Summary

The paper introduces COSMOS, a self-supervised method for detecting out-of-context image and text pairs. It leverages image-text grounding to identify scenarios where unaltered images are used in misleading contexts, achieving 85% accuracy on a newly created dataset of 200K images and 450K captions.

Abstract

Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our key insight is to leverage the grounding of image with text to distinguish out-of-context scenarios that cannot be disambiguated with language alone. We propose a self-supervised training strategy where we only need a set of captioned images. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check if both captions correspond to the same object(s) in the image but are semantically different, which allows us to make fairly accurate out-of-context predictions. Our method achieves 85% out-of-context detection accuracy. To facilitate benchmarking of this task, we create a large-scale dataset of 200K images with 450K textual captions from a variety of news websites, blogs, and social media posts. The dataset and source code is publicly available at https://shivangi-aneja.github.io/projects/cosmos/.


Key findings
COSMOS achieves over 85% accuracy in detecting out-of-context images. The method effectively uses self-supervised learning, requiring only captioned images for training. The creation of a large-scale dataset facilitates future research in this area.
Approach
COSMOS uses a self-supervised training strategy on captioned images. It learns to align image objects with textual descriptions and, at test time, compares the alignment of two captions with the same image to determine if they are semantically different while referring to the same objects, indicating out-of-context use.
Datasets
A large-scale dataset of 200K images with 450K textual captions from news websites, blogs, and social media posts. A subset of 1700 image-caption triplets was manually annotated for benchmarking.
Model(s)
Mask-RCNN (for object detection), ResNet-50 (within the Object Encoder), Universal Sentence Encoder (USE) (for sentence embedding), and SBERT (for sentence similarity). A custom Image-Text Matching model is trained using a max-margin loss.
Author countries
Germany, USA