Multi-Modal Semantic Inconsistency Detection in Social Media News Posts

Authors: Scott McCrae, Kehan Wang, Avideh Zakhor

Published: 2021-05-26 21:25:27+00:00

AI Summary

This paper presents a novel multi-modal classification architecture for detecting semantic inconsistencies between video appearance and text captions in social media news posts. It uses a fusion framework combining textual analysis, audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification to achieve 60.5% accuracy, surpassing uni-modal approaches.

Abstract

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.


Key findings
The multi-modal approach achieved 60.5% accuracy in detecting semantic inconsistencies, significantly outperforming uni-modal models (below 50%). Ablation studies confirmed the necessity of fusing multiple modalities, particularly named entity verification (using both textual and visual methods) and semantic embeddings from video and text.
Approach
The authors propose a multi-modal fusion framework that leverages an ensemble method. This integrates features extracted from textual analysis of captions, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification to classify semantic consistency. A multi-layer perceptron and LSTM are used for feature fusion and classification.
Datasets
A new dataset of 4,000 real-world Facebook news posts with videos, curated by the authors, where half are pristine caption-video pairs and half have captions randomly swapped to create inconsistencies.
Model(s)
BERT (for text processing), S3D (for video understanding), ResNet50 (for object detection), FaceNet (for facial verification), and a custom multi-layer perceptron and LSTM network for multi-modal fusion and classification.
Author countries
USA