Deep Convolutional Pooling Transformer for Deepfake Detection

View on arXiv ← Back to list

Authors: Tianyi Wang, Harry Cheng, Kam Pui Chow, Liqiang Nie

Published: 2022-09-12 15:05:41+00:00

AI Summary

This paper proposes a deep convolutional pooling Transformer for deepfake detection, integrating convolutional pooling and re-attention to capture both local and global image features. The model leverages keyframes, which retain complete image information, for improved performance and robustness.

Abstract

Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.

Key findings

The proposed model consistently outperforms state-of-the-art baselines on both within-dataset and cross-dataset evaluations. The use of keyframes significantly improves performance, and the ablation studies demonstrate the effectiveness of the model's components and the optimal model depth.

Approach

The authors address the limitations of existing methods by proposing a deep convolutional Transformer that combines convolutional pooling and re-attention mechanisms to learn both local and global features from video keyframes. Keyframes are used because they avoid information loss from video compression.

Datasets

FaceForensics++ (FF++), Deepfake Detection Challenge (DFDC), Celeb-DF, DeeperForensics-1.0 (DF-1.0)

Model(s)

Deep Convolutional Pooling Transformer (with CNN backbone, depth-wise separable convolutions, and a multi-head self-attention mechanism with re-attention)

Author countries

Hong Kong, China

← Previous