Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

View on arXiv ← Back to list

Authors: Kaiqing Lin, Yuzhen Lin, Weixiang Li, Taiping Yao, Bin Li

Published: 2024-09-04 12:46:30+00:00

AI Summary

This paper proposes RepDFD, a novel deepfake detection method that repurposes a pre-trained Vision-Language Model (VLM) like CLIP without fine-tuning its internal parameters. It achieves this by learning visual perturbations and adaptive text prompts, significantly improving cross-dataset and cross-manipulation performance.

Abstract

The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.

Key findings

RepDFD consistently improves cross-dataset and cross-manipulation deepfake detection performance, achieving over 88% AUC in some cross-dataset settings. Superior performance is achieved with significantly fewer trainable parameters (0.078M) compared to other state-of-the-art methods, making it efficient for real-world applications. Asymmetric text prompt incorporation, using face embeddings only for fake labels, proves most effective.

Approach

RepDFD uses learnable visual perturbations to refine CLIP's feature extraction for deepfake detection. It also generates sample-level adaptive text prompts based on face embeddings to improve performance, all without modifying CLIP's internal parameters.

Datasets

FaceForensics++ (FF++), Celeb-DF-v2 (CDF), DeepFake Detection Challenge preview (DFDCP), DeepFake Detection Challenge public (DFDC), WildDeepfake (Wild)

Model(s)

CLIP (Contrastive Language-Image Pre-training), Transface (for face embedding extraction)

Author countries

China

← Previous