Video Transformer for Deepfake Detection with Incremental Learning

Authors: Sohail A. Khan, Hang Dai

Published: 2021-08-11 16:22:56+00:00

Comment: Accepted at ACM International Conference on Multimedia, October 20 to 24, 2021, Virtual Event, China

AI Summary

This paper introduces a novel video transformer with incremental learning for deepfake video detection. It leverages 3D face reconstruction to generate UV texture maps from input face images, combining these with aligned face images to extract enhanced features. The model employs an incremental learning strategy, enabling state-of-the-art deepfake detection performance and improved generalization across various public datasets.

Abstract

Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.


Key findings
The proposed video transformer model, especially with the fusion of image and video transformer predictions, achieved state-of-the-art performance across FaceForensics++, DFD, and DFDC datasets. The incremental learning strategy significantly enhanced the model's generalization capabilities, allowing effective fine-tuning on new data with minimal samples while maintaining performance on prior datasets. Furthermore, the use of UV texture maps and segment embeddings proved effective in enhancing feature learning and detection accuracy.
Approach
The authors propose a novel video transformer model that processes both aligned face images and their corresponding UV texture maps, generated via 3D face reconstruction, to enhance feature learning. It incorporates learnable segment embeddings to differentiate input types and utilizes an incremental learning strategy for robust fine-tuning on new deepfake datasets without performance degradation on previous ones.
Datasets
FaceForensics++, DFDC dataset, DeepFake Detection (DFD) dataset
Model(s)
Video Transformer (modified Vision Transformer base architecture), XceptionNet (for image feature extraction), 3D Dense Face Alignment (3DDFA) model, Single Shot Detector (SSD)
Author countries
United Arab Emirates