Spatiotemporal Inconsistency Learning for DeepFake Video Detection

Authors: Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Lizhuang Ma

Published: 2021-09-04 13:05:37+00:00

AI Summary

This paper proposes a novel Spatiotemporal Inconsistency Learning (STIL) block for deepfake video detection. STIL leverages both spatial and temporal inconsistencies in forged videos to improve detection accuracy, integrating spatial and temporal information within a 2D CNN framework.

Abstract

The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors.


Key findings
The proposed STIL block significantly outperforms state-of-the-art methods on four benchmark datasets. The approach shows strong generalization capabilities across different datasets. Ablation studies confirm the effectiveness of each module within the STIL block and the importance of capturing both spatial and temporal inconsistencies.
Approach
The authors propose a STIL block consisting of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). TIM exploits temporal differences between adjacent frames in both horizontal and vertical directions to capture inconsistencies. SIM and TIM's outputs are fused by ISM for a more comprehensive representation.
Datasets
FaceForensics++ (FF++), Celeb-DF, Deepfake Detection Challenge (DFDC), WildDeepfake
Model(s)
ResNet50 with the proposed STIL block integrated into its bottleneck blocks.
Author countries
China