Is Artificial Intelligence Generated Image Detection a Solved Problem?

Authors: Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, Zhangjie Fu

Published: 2025-05-18 10:00:39+00:00

AI Summary

This paper introduces AIGIBench, a benchmark for evaluating AI-generated image detectors' robustness and generalization. Experiments on 11 detectors reveal significant performance drops on real-world data, highlighting the need for more robust detection strategies.

Abstract

The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.


Key findings
Despite high accuracy in controlled settings, detectors suffer significant performance drops on real-world data. Common augmentations provide limited benefits. Test-time pre-processing (cropping vs. resizing) shows nuanced effects, primarily improving real image accuracy but not necessarily fake image accuracy.
Approach
AIGIBench simulates real-world challenges by evaluating detectors across four tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It uses 23 diverse fake image subsets and real-world samples from social media and AI art platforms.
Datasets
AIGIBench dataset: 23 diverse fake image subsets (GANs, diffusion models, deepfakes, personalized generation), real-world samples from social media (X, Facebook, Reddit), and AI art platforms (ArtStation, Civitai, Liblib); FFHQ, CelebA-HQ, and Open Images V7 for real images; ProGAN and SD-v1.4 for training datasets.
Model(s)
ResNet-50, CNNDetection, Gram-Net, LGrad, CLIPDetection, FreqNet, NPR, LaDeDa, DFFreq, AIDE, SAFE
Author countries
China, Italy