How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

View on arXiv ← Back to list

Authors: Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai, Tiangratanakul, Tsang, Ng, En Wei, Jiayu Xue

Published: 2026-02-08 04:36:13+00:00

AI Summary

This paper presents the first comprehensive zero-shot evaluation of 16 state-of-the-art AI-generated image detection methods (23 variants) across 12 diverse datasets, comprising 2.6 million image samples from 291 unique generators. The study reveals that no universal detector exists, with significant performance instability and a substantial gap between the best (75.0% accuracy) and worst (37.5%) methods. It highlights that training data alignment critically impacts generalization more than architectural choices, and modern commercial generators largely defeat current detectors.

Abstract

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\\% mean accuracy) from the worst (37.5\\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

Key findings

The study found no universal best detector, with significant ranking instability across datasets and a 37 percentage-point performance gap between the best (75.0% mean accuracy) and worst (37.5% mean accuracy) methods. Training data alignment was identified as more critical than architectural design for zero-shot generalization, explaining 20–60% performance variance within detector families. Furthermore, modern commercial generators (e.g., Flux Dev, Firefly v4, Midjourney v7) largely defeated most detectors, achieving only 18–30% average accuracy, indicating an accelerating arms race.

Approach

The researchers conducted a zero-shot benchmark study by evaluating 16 distinct state-of-the-art AI-generated image detection methods (23 pretrained detector variants) on 12 diverse datasets without any fine-tuning or threshold optimization. They analyzed detector performance, ranking stability across datasets, generalization factors by comparing variants with identical architectures but different training data, and identified systematic failure patterns using statistical tests like Friedman and Spearman rank correlation.

Datasets

GenImage, AIGCDetectionBench, WildFake, OpenFake, Diffusion1kstep, SynthBuster, Chameleon, AI-GenBench, GPT-4o, Nano-consistent-150k (Nano-Banana), MNW Benchmark, Community Forensics

Model(s)

PatchCraft (EfficientNet-B4), AIDE (ResNet-50 variants), CNNSpot (ResNet-50), SPAI (Custom CNN / Spectral), LOTA (DenseNet-121), Effort (VFM + SVD), ForgeLens (CLIP-ViT + WSGM), DRCT (CLIP-ViT-B/16 variants, ConvNeXt-Base variants), FreDect (Frequency-domain Net), Gram (GramNet), LGrad (LaplacianGrad), Fusing (Multi-frequency fusion), UnivFD (Universal Freq. Detector), Community-Forensics (Multi-model ensemble), SAFE (Adaptive Ensemble/Transformer), Forensic-MoE (ViT-B/16 + LoRA Adapters)

Author countries

United States

← Previous