CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

Authors: Sohail Ahmed Khan, Duc-Tien Dang-Nguyen

Published: 2024-02-20 11:26:42+00:00

AI Summary

This paper proposes adapting pre-trained vision-language models (VLMs), specifically CLIP, for universal deepfake detection. By retaining the text component and employing Prompt Tuning, the approach outperforms the previous state-of-the-art by a significant margin, achieving higher accuracy and mAP while using less training data.

Abstract

The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.


Key findings
Prompt Tuning significantly outperforms previous state-of-the-art methods in deepfake detection, achieving a 5.01% increase in mAP and a 6.61% improvement in accuracy. The model demonstrates robustness even with reduced training data and various post-processing operations, showing good generalization across different deepfake generation methods.
Approach
The authors adapt the pre-trained CLIP model for deepfake detection using Prompt Tuning, a transfer learning technique that leverages both visual and textual encoders. This method optimizes a small set of parameters while keeping the main CLIP model frozen, leading to improved performance and generalization.
Datasets
ProGAN for training; 21 distinct datasets (including GAN-based, Diffusion-based, and Commercial tools) for testing.
Model(s)
CLIP (Contrastive Language-Image Pre-training) with variations using Linear Probing, Fine-tuning, Adapter Network, and Prompt Tuning.
Author countries
Norway