Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

Authors: Andrii Yermakov, Jan Cech, Jiri Matas

Published: 2025-03-25 14:10:54+00:00

AI Summary

This paper proposes a generalizable deepfake detection method using CLIP's ViT-L/14 visual encoder, achieving competitive accuracy with minimal model modifications. It leverages parameter-efficient fine-tuning and regularization techniques to enhance robustness across diverse datasets and forgery methods.

Abstract

This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection


Key findings
The proposed method achieves competitive detection accuracy comparable to or exceeding more complex state-of-the-art techniques across multiple datasets. Ablation studies confirm the effectiveness of parameter-efficient fine-tuning and regularization in improving generalization and mitigating overfitting.
Approach
The method uses CLIP's ViT-L/14 visual encoder, fine-tuned with parameter-efficient techniques like LN-tuning. A tailored preprocessing pipeline optimizes facial image processing, and regularization strategies like L2 normalization and metric learning enhance generalization.
Datasets
FaceForensics++, Celeb-DF-v2, DFDC, FFIW, DeepSpeak v1.0, Google's DFD
Model(s)
CLIP's ViT-L/14 visual encoder
Author countries
Czech Republic