DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

Authors: MD Sadik Hossain Shanto, Mahir Labib Dihan, Souvik Ghosh, Riad Ahmed Anonto, Hafijul Hoque Chowdhury, Abir Muhtasim, Rakib Ahsan, MD Tanvir Hassan, MD Roqunuzzaman Sojib, Sheikh Azizul Hakim, M. Saifur Rahman

Published: 2025-01-28 04:46:50+00:00

AI Summary

This research proposes a robust deepfake detection system using an ensemble of three advanced backbone models (MaxViT, CoAtNet, and EVA-02) trained with supervised contrastive loss. The ensemble leverages the models' complementary strengths in feature extraction and achieves high accuracy on a diverse deepfake dataset.

Abstract

This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.


Key findings
The ensemble model achieved 95.83% accuracy on the validation dataset, significantly outperforming baseline models. Supervised contrastive learning effectively separated real and fake image embeddings. The combination of diverse models and augmentation techniques resulted in robust and generalized deepfake detection.
Approach
The approach uses three pre-trained backbone models (MaxViT, CoAtNet, and EVA-02) fine-tuned with supervised contrastive loss for feature separation. After freezing the backbone parameters, classification heads are trained, and a majority voting ensemble combines their predictions.
Datasets
DFWild-Cup 2025 dataset (including images from DeepfakeBench, Celeb-DF-v1, Celeb-DF-v2, FaceForensics++, DeepfakeDetection, FaceShifter, UADFV, Deepfake Detection Challenge Preview, and Deepfake Detection Challenge) and a secondary dataset of 12,200 additional fake images generated using various deepfake methods.
Model(s)
MaxViT, CoAtNet, EVA-02 (ensembled using majority voting). ResNet50, ResNet152, ResNet101, InceptionV3, and InceptionResNetV2 were used for baseline comparison.
Author countries
Bangladesh