A Novel Unified Approach to Deepfake Detection

Authors: Lord Sen, Shyamapada Mukherjee

Published: 2026-01-06 19:30:53+00:00

AI Summary

This paper introduces a novel architecture for Deepfake detection in images and videos. The proposed method utilizes cross-attention between spatial and frequency domain features, augmented with a blood detection module, to classify content as real or fake. It achieves state-of-the-art results, including 99.80% and 99.88% AUC on FF++ and Celeb-DF datasets, and demonstrates strong generalization across datasets.

Abstract

The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.


Key findings
The proposed architecture achieved state-of-the-art results, with AUCs of 99.80% on FF++ and 99.88% on Celeb-DF using Swin Transformer and BERT. It also demonstrated strong cross-dataset generalization capabilities, achieving 94.01% AUC on Celeb-DF when trained on FF++, indicating robustness against diverse deepfake generation techniques.
Approach
The approach processes input images and videos through dual streams: one extracts spatial features using a CNN (like EfficientNet-B4) or Vision Transformer (like Swin Transformer), and another extracts frequency domain features using a transformer (like BERT). These features are fused via a cross-stream attention mechanism. A parallel blood detection module contributes to the final classification, which is refined through multi-scale patch embedding and a class token refinement module.
Datasets
FaceForensics++ (FF++), Celeb-DF (CDF), WildDeepfake (WDF), DeepFakeDetection (DFD), DeepFake Detection Challenge (DFDC)
Model(s)
Swin Transformer, BERT, EfficientNet-B4, DistilBERT (as alternative to BERT)
Author countries
India