CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features

Authors: Kafi Anan, Anindya Bhattacharjee, Ashir Intesher, Kaidul Islam, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz

Published: 2025-02-15 06:02:11+00:00

AI Summary

This paper proposes CAE-Net, a weighted ensemble model for generalized deepfake image detection. CAE-Net addresses class imbalance using a disjoint set-based multistage training method and combines EfficientNet, DeiT, and ConvNeXt models with wavelet transforms to capture both spatial and frequency features, achieving 94.63% accuracy on a diverse dataset.

Abstract

Effective deepfake detection tools are becoming increasingly essential to the growing usage of deepfakes in unethical practices. There exists a wide range of deepfake generation techniques, which makes it challenging to develop an accurate universal detection mechanism. The 2025 IEEE Signal Processing Cup (textit{DFWild-Cup} competition) provided a diverse dataset of deepfake images containing significant class imbalance. The images in the dataset are generated from multiple deepfake image generators, for training machine learning model(s) to emphasize the generalization of deepfake detection. To this end, we proposed a disjoint set-based multistage training method to address the class imbalance and devised an ensemble-based architecture emph{CAE-Net}. Our architecture consists of a convolution- and attention-based ensemble network, and employs three different neural network architectures: EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet transform to capture both local and global features of deepfakes. We visualize the specific regions that these models focus on for classification using Grad-CAM, and empirically demonstrate the effectiveness of these models in grouping real and fake images into cohesive clusters using t-SNE plots. Individually, the EfficientNet B0 architecture has achieved 90.79% accuracy, whereas the ConvNeXt and the DeiT architecture have achieved 89.49% and 89.32% accuracy, respectively. With these networks, our weighted ensemble model achieves an excellent accuracy of 94.63% on the validation dataset of the SP Cup 2025 competition. The equal error rate of 4.72% and the Area Under the ROC curve of 97.37% further confirm the stability of our proposed method. Finally, the robustness of our proposed model against adversarial perturbation attacks is tested as well, showing the inherent defensive properties of the ensemble approach.


Key findings
The weighted ensemble model achieved 94.63% accuracy on the validation set, with an equal error rate of 4.72% and an AU-ROC of 97.37%. Grad-CAM visualizations highlighted the models' focus on different image regions, and t-SNE plots showed the effective separation of real and fake image clusters. However, the model shows vulnerability to adversarial attacks.
Approach
The authors address the class imbalance in the dataset using a multistage training method with disjoint subsets of fake images. They propose CAE-Net, an ensemble model that combines EfficientNet, DeiT, and a wavelet-transformed ConvNeXt, each focusing on different image features. A weighted average of these models' predictions forms the final output.
Datasets
A diverse dataset from the 2025 IEEE Signal Processing Cup (DFWild-Cup) competition, combining images from eight publicly available datasets: Celeb-DF-v1, Celeb-DF-v2, FaceForensics++, DeepfakeDetection, FaceShifter, UADFV, Deepfake Detection Challenge Preview, and Deepfake Detection Challenge.
Model(s)
EfficientNet B0, DeiT-B, ConvNeXt-tiny with Haar wavelet transform. A weighted ensemble of these three models is used.
Author countries
Bangladesh