Time Step Generating: A Universal Synthesized Deepfake Image Detector

Authors: Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe

Published: 2024-11-17 09:39:50+00:00

AI Summary

This paper introduces Time Step Generating (TSG), a universal synthetic image detector that leverages a pre-trained diffusion model's U-Net as a feature extractor. By controlling the time step of the input, TSG effectively extracts distinguishing features between real and synthetic images, achieving significant improvements in accuracy and generalizability compared to reconstruction-based methods.

Abstract

Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.


Key findings
TSG significantly outperforms previous methods like DIRE and LaRE2 in accuracy, achieving nearly 100% accuracy on several datasets. It is also approximately 10 times faster than DIRE. The method demonstrates robustness against JPEG compression.
Approach
TSG uses a pre-trained diffusion model's U-Net to extract features from images by controlling the time step parameter. These features are then fed into a ResNet-50 classifier to determine if the image is real or synthetic. This approach avoids time-consuming reconstruction processes.
Datasets
GenImage benchmark (containing subsets from BigGAN, VQDM, SD V1.5, ADM, Wukong, Glide, SD V1.4, and Midjourney)
Model(s)
Pre-trained U-Net (from a class-unconditional ImageNet diffusion model) and ResNet-50
Author countries
Japan