Robust AI-Synthesized Image Detection via Multi-feature Frequency-aware Learning

View on arXiv ← Back to list

Authors: Hongfei Cai, Chi Liu, Sheng Shen, Youyang Qu, Peng Gui

Published: 2025-04-02 03:57:12+00:00

AI Summary

This paper proposes a multi-feature fusion framework for AI-synthesized image detection that combines spatial forensic features (noise correlation, image gradients, pretrained vision encoder knowledge) with frequency-aware learning using Frequency-Adaptive Dilated Convolution. This approach achieves significantly higher accuracy in cross-model detection compared to state-of-the-art methods and demonstrates robustness against image noise.

Abstract

The rapid progression of generative AI (GenAI) technologies has heightened concerns regarding the misuse of AI-generated imagery. To address this issue, robust detection methods have emerged as particularly compelling, especially in challenging conditions where the targeted GenAI models are out-of-distribution or the generated images have been subjected to perturbations during transmission. This paper introduces a multi-feature fusion framework designed to enhance spatial forensic feature representations with incorporating three complementary components, namely noise correlation analysis, image gradient information, and pretrained vision encoder knowledge, using a cross-source attention mechanism. Furthermore, to identify spectral abnormality in synthetic images, we propose a frequency-aware architecture that employs the Frequency-Adaptive Dilated Convolution, enabling the joint modeling of spatial and spectral features while maintaining low computational complexity. Our framework exhibits exceptional generalization performance across fourteen diverse GenAI systems, including text-to-image diffusion models, autoregressive approaches, and post-processed deepfake methods. Notably, it achieves significantly higher mean accuracy in cross-model detection tasks when compared to existing state-of-the-art techniques. Additionally, the proposed method demonstrates resilience against various types of real-world image noise perturbations such as compression and blurring. Extensive ablation studies further corroborate the synergistic benefits of fusing multi-model forensic features with frequency-aware learning, underscoring the efficacy of our approach.

Key findings

The proposed method outperforms state-of-the-art techniques in cross-model detection accuracy. It demonstrates robustness against real-world image noise perturbations like compression and blurring. Ablation studies confirm the synergistic benefits of multi-feature fusion and frequency-aware learning.

Approach

The method fuses three spatial features (noise correlation, image gradients, and pretrained vision encoder knowledge) using a cross-source attention mechanism. It then incorporates a frequency-aware architecture with Frequency-Adaptive Dilated Convolution to model both spatial and spectral features, improving generalization and robustness.

Datasets

Real images and fake images from various generative models, including ProGAN, StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, guided diffusion model, Latent Diffusion Model (LDM), Glide, DALL-E, and others. ProGAN images were used for training; others for testing.

Model(s)

A custom multi-feature fusion and frequency-aware learning framework incorporating ResNet-50, CLIP-ViT, and Frequency-Adaptive Dilated Convolution (FADC).

Author countries

China, Australia

← Previous