DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection

Authors: Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang

Published: 2024-10-31 06:26:00+00:00

AI Summary

This paper proposes DIP, a transformer-based framework for deepfake video detection that leverages directional inconsistencies in motion information. DIP uses a spatiotemporal encoder, a directional inconsistency decoder with direction-aware attention and inconsistency diffusion, and a spatiotemporal invariant loss to achieve state-of-the-art performance.

Abstract

With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.


Key findings
DIP effectively identifies directional forgery clues and achieves state-of-the-art performance on several public datasets. The method shows superior generalizability in cross-dataset and cross-manipulation evaluations and robustness to various video distortions. Ablation studies confirm the effectiveness of the proposed modules.
Approach
The approach uses a transformer-based architecture with a spatiotemporal encoder to extract features. A directional inconsistency decoder analyzes horizontal and vertical motion inconsistencies, incorporating direction-aware attention and inconsistency diffusion. A SpatioTemporal Invariant Loss (STI Loss) is used to improve generalization.
Datasets
FaceForensics++, Celeb-DF-v2, WildDeepFake, Deepfake Detection Challenge (DFDC-P and DFDC), DeeperForensics-1.0
Model(s)
Transformer-based framework (DIP) with spatiotemporal encoder and directional inconsistency decoder
Author countries
China