DeCLIP: Decoding CLIP representations for deepfake localization

Authors: Stefan Smeu, Elisabeta Oneata, Dan Oneata

Published: 2024-09-12 17:59:08+00:00

AI Summary

DeCLIP leverages pretrained CLIP representations for deepfake localization, achieving improved generalization compared to existing methods. It uses a convolutional decoder to upscale CLIP embeddings, enabling accurate localization, particularly on challenging latent diffusion model-generated images.

Abstract

Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.


Key findings
DeCLIP significantly improves generalization in out-of-domain scenarios compared to baselines. Larger convolutional decoders enhance localization accuracy. Training on latent diffusion model data boosts generalization performance across various manipulation types.
Approach
DeCLIP extracts features from pretrained CLIP's image encoder. These features are then fed into a convolutional decoder which upscales the low-resolution feature maps to produce a high-resolution localization map indicating manipulated regions.
Datasets
Dolos dataset (faces with locally manipulated attributes), MS COCO dataset (general images with inpainting), AutoSplice dataset (images with objects replaced)
Model(s)
CLIP (ViT-L/14 and ResNet-50 variants), convolutional decoder, Patch Forensics (used as a baseline), PSCC-Net, CAT-Net
Author countries
Romania, Romania