Classifying Deepfakes Using Swin Transformers

Authors: Aprille J. Xi, Eason Chen

Published: 2025-01-26 19:35:46+00:00

AI Summary

This research investigates the use of Swin Transformers for deepfake image detection. The Swin Transformer model, along with hybrid models combining it with ResNet and KNN, achieved higher accuracy (71.29%) than conventional CNN architectures like VGG16, ResNet18, and AlexNet on the Real and Fake Face Detection dataset.

Abstract

The proliferation of deepfake technology poses significant challenges to the authenticity and trustworthiness of digital media, necessitating the development of robust detection methods. This study explores the application of Swin Transformers, a state-of-the-art architecture leveraging shifted windows for self-attention, in detecting and classifying deepfake images. Using the Real and Fake Face Detection dataset by Yonsei University's Computational Intelligence Photography Lab, we evaluate the Swin Transformer and hybrid models such as Swin-ResNet and Swin-KNN, focusing on their ability to identify subtle manipulation artifacts. Our results demonstrate that the Swin Transformer outperforms conventional CNN-based architectures, including VGG16, ResNet18, and AlexNet, achieving a test accuracy of 71.29%. Additionally, we present insights into hybrid model design, highlighting the complementary strengths of transformer and CNN-based approaches in deepfake detection. This study underscores the potential of transformer-based architectures for improving accuracy and generalizability in image-based manipulation detection, paving the way for more effective countermeasures against deepfake threats.


Key findings
The Swin Transformer achieved the highest test accuracy (71.29%) compared to other models. Hybrid models showed promise, but the Swin-KNN model suffered from overfitting. The study highlights the potential of Swin Transformers and their hybrids for improving deepfake detection accuracy.
Approach
The authors used a pre-trained Swin Transformer, adapting it for binary classification by replacing the classification head. They also explored hybrid models combining the Swin Transformer with ResNet and KNN, employing error level analysis (ELA) preprocessing for all models.
Datasets
Real and Fake Face Detection dataset by Yonsei University's Computational Intelligence Photography Lab
Model(s)
Swin Transformer, Swin-ResNet, Swin-KNN, VGG16, ResNet18, AlexNet
Author countries
United States