Reduced Spatial Dependency for More General Video-level Deepfake Detection

Authors: Beilin Chu, Xuan Xu, Yufei Zhang, Weike You, Linna Zhou

Published: 2025-03-05 08:51:55+00:00

AI Summary

This paper proposes Spatial Dependency Reduction (SDR) for video deepfake detection, aiming to improve generalization by reducing the model's reliance on spatial information and focusing on temporal consistency cues. SDR integrates features from spatially-perturbed video clusters using a novel Task-Relevant Feature Integration module and a temporal transformer to capture long-range dependencies.

Abstract

As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.


Key findings
The proposed SDR method outperforms existing methods in cross-dataset generalization tests on Celeb-DF-v2 and DFDC datasets after training on FaceForensics++. Ablation studies confirm the effectiveness of each component, particularly the TRFI module in improving the extraction of generalized temporal consistency features.
Approach
The approach uses multiple spatially perturbed video versions as input to separate branches. A Task-Relevant Feature Integration (TRFI) module, leveraging mutual information theory and contrastive learning, extracts common temporal consistency features from these branches. A temporal transformer then processes these features for final classification.
Datasets
FaceForensics++, Celeb-DF-v2, DFDC
Model(s)
3D convolutional backbone (modified R50), Temporal Transformer, Retinaface (for face detection and cropping)
Author countries
China