Robust Sequential DeepFake Detection

Authors: Rui Shao, Tianxing Wu, Ziwei Liu

Published: 2023-09-26 15:01:43+00:00

Comment: IJCV 2025. Extension of our ECCV 2022 paper: arXiv:2207.02204. Code: https://github.com/rshaojimmy/SeqDeepFake

AI Summary

This paper introduces a novel problem, "Detecting Sequential DeepFake Manipulation (Seq-DeepFake)", which aims to predict a sequence of facial manipulation operations on a given face image rather than a simple binary fake/real label. To support this, the authors construct the first Seq-DeepFake dataset with sequential manipulation annotations, including a perturbed version (Seq-DeepFake-P) to mimic real-world scenarios. They propose two Transformer-based models, SeqFakeFormer and SeqFakeFormer++, for robust image-to-sequence detection of these manipulations.

Abstract

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.


Key findings
The proposed SeqFakeFormer and SeqFakeFormer++ models significantly outperform baselines and state-of-the-art deepfake detection methods on both the clean Seq-DeepFake and the challenging perturbed Seq-DeepFake-P datasets. Detecting sequential manipulations with adaptive lengths proves to be much harder than with fixed lengths. The deeper image-sequence reasoning modules in SeqFakeFormer++ are crucial for maintaining robustness against various post-processing perturbations, and the detected sequences are shown to be highly useful for downstream tasks like face recovery.
Approach
The authors propose SeqFakeFormer, a Transformer-based model that uses a CNN for spatial feature extraction and an Image Encoder to capture spatial manipulation traces via self-attention. A Sequence Decoder with Spatially Enhanced Cross-Attention (SECA) then models sequential relations in an auto-regressive manner to detect manipulation sequences. For improved robustness against real-world perturbations, SeqFakeFormer++ is introduced, which further integrates Image-Sequence Contrastive Learning (ISC) and Image-Sequence Matching (ISM) to establish deeper cross-modal correlations between images and sequences.
Datasets
Seq-DeepFake dataset (created by authors, based on CelebA-HQ, CelebAMask-HQ, FFHQ, StyleMapGAN, and Jiang et al.'s [14] facial editing method), Seq-DeepFake-P dataset (Seq-DeepFake with various perturbations applied).
Model(s)
SeqFakeFormer, SeqFakeFormer++, ResNet-18, ResNet-34, ResNet-50 (pre-trained on ImageNet) as backbones.
Author countries
China, Singapore