Face Forgery Detection with Elaborate Backbone

Authors: Zonghui Guo, Yingjie Liu, Jie Zhang, Haiyong Zheng, Shiguang Shan

Published: 2024-09-25 13:57:16+00:00

AI Summary

This paper addresses the poor generalization of face forgery detection (FFD) models by proposing an elaborate FFD backbone. This is achieved through self-supervised pre-training on real-face datasets, a competitive fine-tuning framework, and a threshold optimization mechanism.

Abstract

Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.


Key findings
The proposed method achieves state-of-the-art performance on several benchmark datasets, demonstrating superior generalization and adaptation capabilities. The use of self-supervised pre-training on real faces and the competitive fine-tuning framework are crucial for improved performance. The threshold optimization mechanism significantly enhances the accuracy and reliability of the FFD model.
Approach
The authors propose a three-stage approach: (1) pre-training a Vision Transformer (ViT) backbone using self-supervised learning on a large real-face dataset; (2) fine-tuning the backbone with a competitive learning mechanism using two branches to extract diverse forgery cues; (3) optimizing the classification threshold using prediction confidence to improve inference reliability.
Datasets
CelebA, CelebV-Text, FFHQ (for pre-training); FaceForensics++, Celeb-DF, DFDC, FFIW (for evaluation); 9 additional cross-datasets generated by various GAN-based and Diffusion Model-based methods.
Model(s)
Vision Transformer (ViT), ResNet, Xception, EfficientNet (for comparison); MoCo v3, MAE, BEiT v2 (for self-supervised learning)
Author countries
China