Detecting and Recovering Sequential DeepFake Manipulation

View on arXiv ← Back to list

Authors: Rui Shao, Tianxing Wu, Ziwei Liu

Published: 2022-07-05 17:59:33+00:00

AI Summary

This paper introduces a novel problem of detecting sequential deepfake manipulations (Seq-DeepFake), where multiple facial manipulations are applied sequentially. It proposes a Seq-DeepFake Transformer (SeqFakeFormer) to address this, creating the first Seq-DeepFake dataset for this task.

Abstract

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence (e.g. image captioning) task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols and metrics for this new research problem. Extensive experiments demonstrate the effectiveness of SeqFakeFormer. Several valuable observations are also revealed to facilitate future research in broader deepfake detection problems.

Key findings

SeqFakeFormer outperforms baseline methods and existing state-of-the-art deepfake detection models on the newly created Seq-DeepFake dataset. The model effectively leverages both spatial and sequential information for improved accuracy, particularly in the more challenging scenario of detecting sequences with variable lengths. The detected sequences also enable successful face recovery.

Approach

The authors frame Seq-DeepFake detection as an image-to-sequence problem. They propose SeqFakeFormer, which uses a CNN to extract spatial features, a transformer encoder to model spatial relations, and a transformer decoder with spatially enhanced cross-attention to model sequential relations and predict the manipulation sequence.

Datasets

A new Seq-DeepFake dataset created by the authors, using CelebA-HQ, CelebAMask-HQ, and FFHQ datasets as a basis for sequential facial component and attribute manipulations.

Model(s)

Seq-DeepFake Transformer (SeqFakeFormer) which combines a CNN (ResNet-34 or ResNet-50), a transformer encoder, and a transformer decoder with a spatially enhanced cross-attention module.

Author countries

Singapore

← Previous