Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection

Authors: Hao Wang, Cheng Deng, Zhidong Zhao

Published: 2025-01-01 02:18:18+00:00

AI Summary

This paper proposes a knowledge-guided prompt learning method for deepfake facial image detection that leverages forgery-related prompts from large language models to guide prompt optimization and utilizes test-time prompt tuning to mitigate domain shift. Extensive experiments demonstrate significant performance improvements over state-of-the-art methods on the DeepFakeFaceForensics dataset.

Abstract

Recent generative models demonstrate impressive performance on synthesizing photographic images, which makes humans hardly to distinguish them from pristine ones, especially on realistic-looking synthetic facial images. Previous works mostly focus on mining discriminative artifacts from vast amount of visual data. However, they usually lack the exploration of prior knowledge and rarely pay attention to the domain shift between training categories (e.g., natural and indoor objects) and testing ones (e.g., fine-grained human facial images), resulting in unsatisfactory detection performance. To address these issues, we propose a novel knowledge-guided prompt learning method for deepfake facial image detection. Specifically, we retrieve forgery-related prompts from large language models as expert knowledge to guide the optimization of learnable prompts. Besides, we elaborate test-time prompt tuning to alleviate the domain shift, achieving significant performance improvement and boosting the application in real-world scenarios. Extensive experiments on DeepFakeFaceForensics dataset show that our proposed approach notably outperforms state-of-the-art methods.


Key findings
The proposed method significantly outperforms state-of-the-art methods on the DeepFakeFaceForensics dataset, achieving a notable improvement in AUC. Ablation studies confirm the effectiveness of both knowledge-guided prompt learning and test-time prompt tuning. The method shows robustness to hyperparameter changes.
Approach
The approach uses a pre-trained vision-language model (CLIP). It retrieves forgery-related prompts from a large language model to guide the learning of learnable prompts, incorporating prior knowledge. Then, it performs test-time prompt tuning using pseudo-labels generated by the model to alleviate domain shift between training and testing data.
Datasets
LSUN (for training), DeepFakeFaceForensics (for testing)
Model(s)
CLIP (ViT-L/14 variant)
Author countries
China, China