Deepfake Detection with Spatio-Temporal Consistency and Attention

Authors: Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian

Published: 2025-02-12 08:51:33+00:00

AI Summary

This paper proposes a neural deepfake detector that utilizes spatio-temporal consistency and attention mechanisms to identify localized manipulation signatures in videos. The model incorporates spatial attention for frame-level features and temporal attention for frame sequence inconsistencies, achieving state-of-the-art performance on two large datasets.

Abstract

Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art methods.Moreover, our technique also provides memory and computational advantages over the competitive techniques.


Key findings
The proposed model outperforms eight state-of-the-art methods on both FaceForensics++ (LQ) and DFDC datasets in terms of accuracy and AUC. It also demonstrates improved memory and computational efficiency compared to existing methods. The attention mechanisms effectively highlight localized and temporal inconsistencies indicative of deepfakes.
Approach
The authors propose a deepfake detection method that uses a ResNet backbone enhanced with spatial and temporal attention mechanisms. Spatial attention focuses on localized artifacts within individual frames, while temporal attention analyzes inconsistencies between consecutive frames using optical flow. The combined features are then used for classification.
Datasets
FaceForensics++ (LQ) and Deepfake Detection Challenge (DFDC)
Model(s)
ResNet50 with added spatial and temporal attention modules (WS-DAN for spatial attention and a ViT-based distance attention mechanism for temporal attention). A Dense block is used for texture enhancement.
Author countries
Australia