Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning

Authors: Ye Zhu, Yunan Wang, Zitong Yu

Published: 2025-05-11 00:26:13+00:00

AI Summary

This paper introduces the MFND dataset, a large-scale multimodal fake news dataset with 11 manipulation types and detailed localization labels. It also proposes SDML, a shallow-deep multitask learning model that leverages unimodal and multimodal features for improved fake news detection and localization.

Abstract

Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.


Key findings
SDML achieved state-of-the-art performance on multiple datasets, outperforming baselines in both multi-task and single-task settings. Ablation studies confirmed the importance of all proposed modules. The model effectively detects and localizes manipulated images and text in diverse deepfake scenarios.
Approach
The SDML model uses a shallow inference stage for alignment and fusion of image and text features, followed by a deep inference stage with separate branches for image and text to enhance unimodal features and make four predictions (binary classification of media news, image forgery detection and localization, and text forgery detection).
Datasets
MFND dataset (created by authors), Twitter, Weibo, DGM4
Model(s)
Shallow-Deep Multitask Learning (SDML) model, ViT-B/16 (vision encoder), BERT-16 (text encoder), and several custom modules (LPCL, ACMF, MVE, CA)
Author countries
China