AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures

Authors: Sifatullah Sheikh Urmi, Kirtonia Nuzath Tabassum Arthi, Md Al-Imran

Published: 2026-01-03 20:44:50+00:00

Comment: 6 pages, 6 figures, 3 tables. Conference paper

AI Summary

This paper proposes and evaluates four AI models, including three CNNs (DFCNET, MobileNetV3, ResNet50) and one Vision Transformer (VFDNET), for deepfake detection using large face image datasets. Model robustness and generalization were enhanced through data preprocessing and augmentation strategies. VFDNET achieved the highest accuracy, demonstrating the efficacy of AI-powered approaches for dependable deepfake detection.

Abstract

The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI's capabilities for dependable deepfake detection.


Key findings
The VFDNET model demonstrated superior performance, achieving the highest accuracy of 99.13% and balanced precision, recall, and F1-score (99.00%) for deepfake image detection. MobileNetV3 also performed robustly with 98.00% accuracy, making it the second-best model. In contrast, ResNet50 showed the lowest performance at 84.28% accuracy, while DFCNET achieved a moderate 95.76% accuracy.
Approach
The authors evaluate four AI models—a custom Deepfake Convolutional Network (DFCNET), MobileNetV3, ResNet50, and a Vision Fake Detection Network (VFDNET) based on Vision Transformers—for classifying real versus fake face images. These models are trained using extensive data preprocessing and augmentation techniques on a large dataset to identify deepfakes. The methodology involves normalizing pixel values, resizing images, and applying augmentations like rotation and flipping to improve model generalization.
Datasets
140K real and fake faces dataset (Kaggle dataset by Xhlulu), which includes 70,000 authentic faces from the Flickr dataset (NVIDIA) and 70,000 fake face images generated by StyleGAN (1 Million Fake Faces dataset by Bojan).
Model(s)
Deepfake Convolutional Network (DFCNET), Vision Fake Detection Network (VFDNET), MobileNetV3, ResNet50
Author countries
Bangladesh