UniForensics: Face Forgery Detection via General Facial Representation

View on arXiv ← Back to list

Authors: Ziyuan Fang, Hanqing Zhao, Tianyi Wei, Wenbo Zhou, Ming Wan, Zhanyi Wang, Weiming Zhang, Nenghai Yu

Published: 2024-07-26 20:51:54+00:00

AI Summary

UniForensics is a deepfake detection framework using high-level semantic facial features to identify temporal inconsistencies in videos. It leverages a transformer-based video classification network initialized with a meta-functional face encoder and a two-stage training approach including self-supervised contrastive learning.

Abstract

Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level semantic features of faces to identify inconsistencies in temporal domain. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video classification network, initialized with a meta-functional face encoder for enriched facial representation. In this way, we can take advantage of both the powerful spatio-temporal model and the high-level semantic information of faces. Furthermore, to leverage easily accessible real face data and guide the model in focusing on spatio-temporal features, we design a Dynamic Video Self-Blending (DVSB) method to efficiently generate training samples with diverse spatio-temporal forgery traces using real facial videos. Based on this, we advance our framework with a two-stage training approach: The first stage employs a novel self-supervised contrastive learning, where we encourage the network to focus on forgery traces by impelling videos generated by the same forgery process to have similar representations. On the basis of the representation learned in the first stage, the second stage involves fine-tuning on face forgery detection dataset to build a deepfake detector. Extensive experiments validates that UniForensics outperforms existing face forgery methods in generalization ability and robustness. In particular, our method achieves 95.3% and 77.2% cross dataset AUC on the challenging Celeb-DFv2 and DFDC respectively.

Key findings

UniForensics outperforms existing methods in generalization and robustness, achieving 95.3% and 77.2% cross-dataset AUC on Celeb-DFv2 and DFDC, respectively. The two-stage training and dynamic video self-blending significantly improve performance.

Approach

UniForensics uses a transformer-based video classification network initialized with a pre-trained face encoder to capture high-level semantic features. A two-stage training process is employed: self-supervised contrastive learning on dynamically self-blended real videos followed by fine-tuning on a deepfake detection dataset.

Datasets

VoxCeleb2 (for self-supervised pretraining), FaceForensics++ (for supervised finetuning), Celeb-DFv2, DFDC, FaceShifter (for cross-dataset evaluation)

Model(s)

Transformer-based video classification network (UniFormerV2) initialized with a pre-trained ViT model (FaRL)

Author countries

China

← Previous