AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Authors: Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

Published: 2025-12-19 16:06:03+00:00

Comment: Under Review

AI Summary

This paper tackles the challenge of generalizable deepfake detection by adapting large vision-language models (VLMs) like CLIP. It introduces AdaptPrompt, a parameter-efficient framework that uses visual adapters and textual prompt tuning, along with an architectural pruning insight, to capture subtle generative artifacts. The study also proposes Diff-Gen, a novel diffusion-generated dataset, and achieves state-of-the-art generalization across diverse synthetic content.

Abstract

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.


Key findings
AdaptPrompt achieved state-of-the-art generalization across 25 diverse deepfake datasets, outperforming baselines on unseen diffusion and commercial models while maintaining performance on GANs. Architectural pruning of CLIP's vision encoder significantly boosted accuracy by retaining high-frequency generative artifacts. The framework also demonstrated strong robustness to post-processing, effective few-shot learning, and high accuracy in source attribution.
Approach
The authors address generalizable deepfake detection by leveraging pre-trained CLIP. They introduce AdaptPrompt, a parameter-efficient framework that combines lightweight visual adapters and learnable textual prompts for fine-tuning CLIP while keeping its backbone frozen. A key insight is pruning the final transformer block of CLIP's vision encoder, which enhances the retention of high-frequency generative artifacts for improved detection.
Datasets
Diff-Gen, ProGAN, LSUN, ImageNet, COCO, CelebA, LAION, YouTube, BigGAN, CycleGAN, EG3D, GauGAN, StarGAN, StyleGAN, StyleGAN-2, StyleGAN-3, Taming-T, Glide, Guided, LDM, Stable Diffusion, SDXL, MidJourney-V5, Adobe Firefly, DALL-E 3, DALL-E(mini), Deepfakes (FF++), FaceSwap (FF++).
Model(s)
UNKNOWN
Author countries
Canada, UAE, Norway