Linguistic Profiling of Deepfakes: An Open Database for Next-Generation Deepfake Detection

Authors: Yabin Wang, Zhiwu Huang, Zhiheng Ma, Xiaopeng Hong

Published: 2024-01-04 16:19:52+00:00

AI Summary

This paper introduces DFLIP-3K, a large-scale open database featuring approximately 300K diverse deepfake samples generated from 3K text-to-image models, coupled with 190K linguistic footprints (prompts). It establishes a benchmark for linguistic profiling of deepfakes, which includes deepfake detection, model identification, and prompt prediction, aiming to foster more explainable and trustworthy deepfake analysis.

Abstract

The emergence of text-to-image generative models has revolutionized the field of deepfakes, enabling the creation of realistic and convincing visual content directly from textual descriptions. However, this advancement presents considerably greater challenges in detecting the authenticity of such content. Existing deepfake detection datasets and methods often fall short in effectively capturing the extensive range of emerging deepfakes and offering satisfactory explanatory information for detection. To address the significant issue, this paper introduces a deepfake database (DFLIP-3K) for the development of convincing and explainable deepfake detection. It encompasses about 300K diverse deepfake samples from approximately 3K generative models, which boasts the largest number of deepfake models in the literature. Moreover, it collects around 190K linguistic footprints of these deepfakes. The two distinguished features enable DFLIP-3K to develop a benchmark that promotes progress in linguistic profiling of deepfakes, which includes three sub-tasks namely deepfake detection, model identification, and prompt prediction. The deepfake model and prompt are two essential components of each deepfake, and thus dissecting them linguistically allows for an invaluable exploration of trustworthy and interpretable evidence in deepfake detection, which we believe is the key for the next-generation deepfake detection. Furthermore, DFLIP-3K is envisioned as an open database that fosters transparency and encourages collaborative efforts to further enhance its growth. Our extensive experiments on the developed benchmark verify that our DFLIP-3K database is capable of serving as a standardized resource for evaluating and comparing linguistic-based deepfake detection, identification, and prompt prediction techniques.


Key findings
Vision-language models, particularly Flamingo, significantly outperform traditional vision-based models in deepfake detection and model identification, especially for out-of-distribution deepfakes. Flamingo also provides more accurate and detailed prompt predictions, leading to reconstructed images that are visually closer to the originals. The DFLIP-3K dataset highlights that modern text-to-image deepfakes are increasingly photorealistic and challenging to detect using traditional methods.
Approach
The authors curate DFLIP-3K, a database of deepfake images (visual content) with associated generative models and textual prompts. They use this dataset to establish a benchmark for three sub-tasks: deepfake detection, model identification, and prompt prediction, addressing these using fine-tuned vision-language models like Flamingo.
Datasets
DFLIP-3K, LAION-5B (for real images)
Model(s)
ResNet-50, ViT-base-16, CLIP, BLIP, Flamingo (specifically OpenFlamingo-9B, utilizing CLIP ViT-Large vision encoder and LLaMA-7B language model) were used as baselines and for their proposed approach in visual deepfake detection, identification, and prompt prediction.
Author countries
P. R. China, United Kingdom