Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Authors: Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi

Published: 2020-12-08 20:30:43+00:00

AI Summary

The paper introduces the Edited Media Understanding (EMU) task, requiring models to answer open-ended questions about the intent and implications of image edits. A new dataset, EMU, with 56k question-answer pairs is created, and a new model, PELICAN, achieves promising results, though significant improvement is still needed compared to human performance.

Abstract

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.


Key findings
PELICAN outperforms baseline models in understanding image edits, with human raters judging its answers accurate 48.2% of the time. However, a significant gap remains between machine and human performance, highlighting the need for further research, especially in incorporating commonsense reasoning.
Approach
The authors propose PELICAN, a model that uses a multimodal transformer to process both the source and edited images. It incorporates importance embeddings derived from a topological sort of image regions, prioritizing regions that were altered or introduced in the edit, to better answer questions about the intent and implications of the changes.
Datasets
EMU dataset (56k question-answer pairs over 8k image pairs from Reddit's r/photoshopbattles)
Model(s)
PELICAN (a multimodal transformer model with importance embeddings), several baselines including GPT-2, Cross-Modality GPT-2, Dynamic Relational Attention, and VLP.
Author countries
USA