Deepfake Geography: Detecting AI-Generated Satellite Images

Authors: Mansur Yerzhanuly

Published: 2025-11-21 20:30:10+00:00

Comment: 18 pages, 8 figures

AI Summary

This study comprehensively compares Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images, addressing the growing threat to satellite imagery authenticity from generative models. Using a curated dataset of over 130,000 labeled RGB images, the research demonstrates that ViTs significantly outperform CNNs in accuracy and robustness. The study further enhances model transparency using architecture-specific interpretability methods, revealing distinct detection behaviors and validating model trustworthiness.

Abstract

The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.


Key findings
Vision Transformers (ViT-B/16) consistently outperformed CNNs (ResNet-50) in detecting AI-generated satellite images, achieving a test accuracy of 95.11% compared to 87.02% for CNNs. ViTs demonstrated greater robustness and an ability to model long-range dependencies and global semantic structures, allowing them to detect broader structural inconsistencies. Interpretability analyses showed CNNs focused on localized texture artifacts, while ViTs captured global spatial relationships and repetitive layouts, explaining their superior performance.
Approach
The authors compare ResNet-50 CNNs and ViT-B/16 Vision Transformers for binary classification of real vs. fake satellite imagery. They leverage a combined dataset of real and AI-generated images, applying dynamic data augmentation during training. Post-hoc interpretability methods like Grad-CAM for CNNs and Chefer's attention attribution for ViTs are used to analyze model decision-making.
Datasets
DM-AER dataset, FSI dataset
Model(s)
ResNet-50 (CNN), ViT-B/16 (Vision Transformer), DenseNet-121 (CNN for additional comparison)
Author countries
UNKNOWN