Exploring the Role of Audio in Multimodal Misinformation Detection

View on arXiv ← Back to list

Authors: Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Guanjun Li

Published: 2024-08-22 17:17:43+00:00

AI Summary

This paper proposes a multimodal misinformation detection framework that incorporates audio, video, and text modalities to detect deepfakes. It investigates the importance of the audio modality and finds that careful alignment of audio and video is crucial for optimal performance.

Abstract

With the rapid development of deepfake technology, especially the deep audio fake technology, misinformation detection on the social media scene meets a great challenge. Social media data often contains multimodal information which includes audio, video, text, and images. However, existing multimodal misinformation detection methods tend to focus only on some of these modalities, failing to comprehensively address information from all modalities. To comprehensively address the various modal information that may appear on social media, this paper constructs a comprehensive multimodal misinformation detection framework. By employing corresponding neural network encoders for each modality, the framework can fuse different modality information and support the multimodal misinformation detection task. Based on the constructed framework, this paper explores the importance of the audio modality in multimodal misinformation detection tasks on social media. By adjusting the architecture of the acoustic encoder, the effectiveness of different acoustic feature encoders in the multimodal misinformation detection tasks is investigated. Furthermore, this paper discovers that audio and video information must be carefully aligned, otherwise the misalignment across different audio and video modalities can severely impair the model performance.

Key findings

Including audio features significantly improves misinformation detection accuracy. The wav2vec2.0 audio encoder outperforms the VGG encoder. Misalignment between audio and video modalities negatively impacts model performance.

Approach

The authors construct a multimodal framework using modality-specific encoders (BERT for text, VGG19 and C3D for video, VGG and wav2vec2.0 for audio) and cross-attention layers to fuse features. A transformer layer processes the fused features before classification.

Datasets

FakeSV dataset

Model(s)

BERT, VGG19, C3D, VGG, wav2vec2.0

Author countries

China

← Previous