Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

View on arXiv ← Back to list

Authors: Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

Published: 2024-07-29 18:00:10+00:00

AI Summary

This paper introduces CoDE, a contrastive learning-based embedding space for deepfake detection that leverages both global and local image features. CoDE achieves state-of-the-art accuracy on a newly created dataset (D3) containing 9.2 million images from four diffusion models and demonstrates strong generalization to unseen generators.

Abstract

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

Key findings

CoDE surpasses existing methods on the D3 dataset, showing significant improvements in accuracy, especially with image transformations. It also exhibits excellent generalization capabilities to unseen image generators, achieving state-of-the-art performance on a challenging extended test set and additional external datasets.

Approach

CoDE uses contrastive learning to train a Vision Transformer from scratch, incorporating both global and local image features. This approach enforces separation between real and fake images in the embedding space while maintaining robustness to transformations via a multi-scale contrastive loss.

Datasets

D3 dataset (9.2 million images from four diffusion models, paired with real images from LAION-400M), LAION-400M, extended test set with images from 12 diffusion models, additional external datasets including images from LDM, GLIDE, DALL-E, DALL-E 2, DALL-E 3, Midjourney, ProGAN, CycleGAN, BigGAN, StyleGAN, GauGAN, StarGAN

Model(s)

Vision Transformer (ViT-Tiny) trained from scratch

Author countries

Italy

← Previous