Multimodal Graph Learning for Deepfake Detection

View on arXiv ← Back to list

Authors: Zhiyuan Yan, Peng Sun, Yubo Lang, Shuo Du, Shanzhuo Zhang, Wei Wang, Lei Liu

Published: 2022-09-12 17:17:49+00:00

AI Summary

This paper proposes Multimodal Graph Learning (MGL), a novel deepfake detection framework that leverages spatial, frequency, temporal, and landmark features. MGL uses Graph Neural Networks (GNNs) at both frame and video levels to capture inconsistencies and fuse multimodal information for improved robustness and generalization.

Abstract

Existing deepfake detectors face several challenges in achieving robustness and generalization. One of the primary reasons is their limited ability to extract relevant information from forgery videos, especially in the presence of various artifacts such as spatial, frequency, temporal, and landmark mismatches. Current detectors rely on pixel-level features that are easily affected by unknown disturbances or facial landmarks that do not provide sufficient information. Furthermore, most detectors cannot utilize information from multiple domains for detection, leading to limited effectiveness in identifying deepfake videos. To address these limitations, we propose a novel framework, namely Multimodal Graph Learning (MGL) that leverages information from multiple modalities using two GNNs and several multimodal fusion modules. At the frame level, we employ a bi-directional cross-modal transformer and an adaptive gating mechanism to combine the features from the spatial and frequency domains with the geometric-enhanced landmark features captured by a GNN. At the video level, we use a Graph Attention Network (GAT) to represent each frame in a video as a node in a graph and encode temporal information into the edges of the graph to extract temporal inconsistency between frames. Our proposed method aims to effectively identify and utilize distinguishing features for deepfake detection. We evaluate the effectiveness of our method through extensive experiments on widely-used benchmarks and demonstrate that our method outperforms the state-of-the-art detectors in terms of generalization ability and robustness against unknown disturbances.

Key findings

The proposed MGL method outperforms state-of-the-art deepfake detectors in terms of generalization and robustness. Experiments demonstrate the effectiveness of multimodal feature fusion and the use of GNNs at both frame and video levels for deepfake detection.

Approach

The approach uses a multimodal graph learning framework. At the frame level, it fuses spatial, frequency, and landmark features using a GNN and a cross-modal transformer. At the video level, it utilizes a GNN to model frames as nodes in a graph, capturing temporal inconsistencies.

Datasets

UNKNOWN

Model(s)

Modified Xception architecture, Graph Neural Networks (GNNs), Graph Attention Network (GAT), Bi-directional cross-modal transformer.

Author countries

China, Hong Kong

← Previous