M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

Authors: Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Ser-Nam Lim, Yu-Gang Jiang

Published: 2021-04-20 05:43:44+00:00

AI Summary

This paper introduces M2TR, a Multi-modal Multi-scale Transformer for Deepfake detection. M2TR leverages both RGB and frequency domain information at multiple scales to identify inconsistencies indicative of manipulation, outperforming state-of-the-art methods. It also introduces a new high-quality Deepfake dataset, SR-DF, to advance research in this area.

Abstract

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.


Key findings
M2TR achieves state-of-the-art performance on several Deepfake detection datasets, including FaceForensics++, Celeb-DF, and the newly introduced SR-DF. The frequency domain analysis improves robustness to compression artifacts. The SR-DF dataset proves more challenging than existing datasets, highlighting the need for more robust detection methods.
Approach
M2TR uses a two-stream architecture processing RGB and frequency domain features. Multi-scale transformers operate on patches of varying sizes to capture inconsistencies, while a cross-modality fusion block combines the information. A multi-task learning approach incorporating face mask prediction is used to mitigate overfitting.
Datasets
FaceForensics++, Celeb-DF, SR-DF (introduced in this paper), ForgeryNet
Model(s)
Multi-modal Multi-scale Transformer (M2TR) with EfficientNet-b4 and various other models for comparison (e.g., Xception, MesoNet, F3-Net, MaDD)
Author countries
China