Time Step Generating: A Universal Synthesized Deepfake Image Detector

Authors: Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe

Published: 2024-11-17 09:39:50+00:00

Comment: 9 pages, 7 figures

AI Summary

This paper proposes Time Step Generating (TSG), a universal synthetic image detector designed to distinguish real images from those generated by diffusion models without relying on reconstruction processes or specific generative models. TSG leverages a pre-trained diffusion model's U-Net as a feature extractor, controlling the time step 't' to capture fine-grained differences, which are then classified by a ResNet. This approach significantly improves detection accuracy and generalizability while being considerably faster than prior reconstruction-based methods.

Abstract

Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.


Key findings
The TSG method achieved an average accuracy of 94.9% (at t=0) on the GenImage benchmark, outperforming the baseline LaRE2 (75.6%) by nearly 20 percentage points. It is also approximately 10 times faster than the DIRE method for feature image generation. The approach demonstrates strong generalization across various generative models and robustness against JPEG compression.
Approach
The Time Step Generating (TSG) method utilizes a pre-trained diffusion model's U-Net as a feature extractor. It captures fine-grained details by feeding the input image into the U-Net at a specific, fixed time step 't', effectively extracting noise prediction information. These extracted features are then passed to a ResNet-50 classifier to determine if the image is real or synthetically generated.
Datasets
GenImage benchmark (with images from BigGAN, VQDM, SD V1.5, ADM, Wukong), and custom 'Unbiased datasets' constructed from Glide, SD V1.4, and Midjourney.
Model(s)
U-Net from a pre-trained class-unconditional ImageNet diffusion model (as feature extractor), ResNet-50 (as classifier).
Author countries
Japan