Self-supervised Transformer for Deepfake Detection

View on arXiv ← Back to list

Authors: Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Nenghai Yu

Published: 2022-03-02 17:44:40+00:00

AI Summary

This paper proposes a self-supervised transformer-based audio-visual contrastive learning method for deepfake detection. The method learns mouth motion representations by aligning paired video and audio features while separating unpaired ones, improving generalization ability compared to supervised pre-training.

Abstract

The fast evolution and widespread of deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors. Some works capture the features that are unrelated to method-specific artifacts, such as clues of blending boundary, accumulated up-sampling, to strengthen the generalization ability. However, the effectiveness of these methods can be easily corrupted by post-processing operations such as compression. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection. For example, lip movement has been proved to be a kind of robust and good-transferring highlevel semantic feature, which can be learned from the lipreading task. However, the existing method pre-trains the lip feature extraction model in a supervised manner, which requires plenty of human resources in data annotation and increases the difficulty of obtaining training data. In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method. The proposed method learns mouth motion representations by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. After pre-training with our method, the model will then be partially fine-tuned for deepfake detection task. Extensive experiments show that our self-supervised method performs comparably or even better than the supervised pre-training counterpart.

Key findings

The self-supervised method achieves comparable or better performance than supervised pre-training methods. The approach shows improved robustness to common corruptions and better generalization to unseen datasets. Larger pre-training datasets improve model performance.

Approach

The authors use a two-stage approach. First, they pre-train a model using contrastive learning on paired and unpaired audio-visual data to learn robust representations of lip movements. Then, they fine-tune a portion of this pre-trained model for deepfake detection.

Datasets

UNKNOWN

Model(s)

Transformer-based architecture with a 3D convolutional layer for video feature extraction, a 2D ResNet module, and a 1D transformer for temporal modeling. A pre-trained wav2vec2 model is used for audio feature extraction.

Author countries

China

← Previous