Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

Authors: Yue-Hua Han, Tai-Ming Huang, Kai-Lung Hua, Jun-Cheng Chen

Published: 2024-04-08 14:58:52+00:00

AI Summary

This paper proposes a novel video-based deepfake detection method using a side-network-based decoder with spatial and temporal modules to adapt a CLIP image encoder. Facial Component Guidance (FCG) enhances spatial learning by focusing on key facial regions, improving generalizability and efficiency.

Abstract

Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.


Key findings
The proposed method outperforms state-of-the-art methods in cross-dataset evaluation. It demonstrates superior training data efficiency and parameter efficiency. The method also shows robustness against common perturbations and generalizes well to Deepfakes generated by modern diffusion models.
Approach
The approach uses a CLIP image encoder to extract spatial and temporal features from video frames. A side-network decoder, incorporating spatial and temporal modules guided by facial component information (FCG), processes these features for deepfake detection. The final prediction is an aggregate score from temporal, spatial, and spatio-temporal classification heads.
Datasets
FaceForensics++, CelebDF-v2, DeepFake Detection Challenge (DFDC), FaceShifter, DeeperForensics, WildDeepfake, DiffusionForensics (CelebA-HQ subset), HeyGen AI avatar videos
Model(s)
CLIP ViT-L/14 image encoder, a custom side-network decoder with spatial and temporal modules
Author countries
Taiwan