Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

Authors: Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada

Published: 2025-01-02 10:21:34+00:00

AI Summary

This paper proposes FakeSTormer, a multi-task learning framework for deepfake video detection that addresses generalization issues. It incorporates auxiliary branches for spatial and temporal artifact attention and uses a novel video-level data synthesis strategy to generate high-quality pseudo-fake videos for training.

Abstract

Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.


Key findings
FakeSTormer outperforms state-of-the-art methods on several deepfake detection benchmarks, demonstrating superior generalization capabilities to unseen datasets and manipulations. The model also shows robustness to various unseen perturbations and data compression levels.
Approach
FakeSTormer uses a multi-task learning framework with three branches: a classification branch, a temporal branch focusing on temporal inconsistencies, and a spatial branch attending to spatial artifacts. A novel video-level data synthesis method (SBV) generates pseudo-fake videos for training, enhancing generalization.
Datasets
FaceForensics++ (FF++), Celeb-DFv2 (CDF), DeepfakeDetection (DFD), Deepfake Detection Challenge Preview (DFDCP), Deepfake Detection Challenge (DFDC), WildDeepfake (DFW), DiffSwap, DF40
Model(s)
Revisited TimeSformer architecture
Author countries
Luxembourg, Tunisia