Spatiotemporal Inconsistency Learning for DeepFake Video Detection

Authors: Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Lizhuang Ma

Published: 2021-09-04 13:05:37+00:00

Comment: To appear in ACM MM 2021

AI Summary

This paper addresses DeepFake video detection by proposing a Spatial-Temporal Inconsistency Learning (STIL) block designed to capture both spatial and temporal inconsistencies in forged videos. The STIL block integrates a Spatial Inconsistency Module (SIM) for intra-frame artifacts, a Temporal Inconsistency Module (TIM) that exploits temporal differences across adjacent frames, and an Information Supplement Module (ISM) to combine these features. This plug-and-play block can be integrated into existing 2D CNNs to enhance their ability to detect sophisticated deepfakes.

Abstract

The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors.


Key findings
The proposed STIL method consistently outperforms state-of-the-art competitors on four widely used public benchmarks (FF++, Celeb-DF, DFDC, WildDeepfake), especially on challenging low-quality compression settings. It demonstrates superior generalization capability in cross-dataset evaluations, indicating its robustness against unseen manipulation methods. Ablation studies confirm the effectiveness of each module, the chosen temporal difference operation, and the unidirectional information flow from spatial to temporal modules.
Approach
The authors propose a novel Spatial-Temporal Inconsistency Learning (STIL) block, which comprises a Spatial Inconsistency Module (SIM) for intra-frame forgery patterns, a Temporal Inconsistency Module (TIM) that captures temporal differences over adjacent frames along horizontal and vertical directions, and an Information Supplement Module (ISM) to fuse these two streams. This block is designed to be plug-and-play into existing 2D CNNs, effectively turning DeepFake detection into an inconsistency learning process.
Datasets
FaceForensics++ (FF++), Celeb-DF, Deepfake Detection Challenge (DFDC), WildDeepfake
Model(s)
ResNet50 (as backbone for STIL block)
Author countries
China