FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

View on arXiv ← Back to list

Authors: Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, Kui Ren

Published: 2024-12-16 17:58:45+00:00

AI Summary

This paper introduces FSFM, a self-supervised pretraining framework for learning robust and transferable facial representations. FSFM leverages masked image modeling and instance discrimination to encode both local and global facial semantics, improving performance on various face security tasks.

Abstract

This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.

Key findings

FSFM outperforms supervised pretraining and other self-supervised methods on cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. It achieves state-of-the-art results on these tasks, demonstrating superior generalization capabilities.

Approach

FSFM uses a self-supervised pretraining framework combining masked image modeling (MIM) and instance discrimination (ID). A novel CRFR-P masking strategy for MIM is introduced to improve intra-region consistency and inter-region coherency. The ID network couples with MIM via self-distillation to achieve local-to-global correspondence.

Datasets

VGGFace2, FaceForensics++, CelebDF-v2, Deepfake Detection Challenge, Deepfake Detection Challenge preview, Wild Deepfake, MSU-MFSD, CASIA-FASD, Idiap Replay-Attack, OULU-NPU, DiFF

Model(s)

Vision Transformer (ViT-B/16, ViT-S/16, ViT-L/16), Masked Autoencoder (MAE)

Author countries

China

← Previous