SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Authors: Vrushank Ahire, Aniruddh Muley, Shivam Zample, Siddharth Verma, Pranav Menon, Surbhi Madan, Abhinav Dhall

Published: 2025-10-06 09:35:57+00:00

Journal Ref: IEEE SPS Signal Processing Cup at ICASSP 2025

AI Summary

This paper introduces SFANet, a novel ensemble framework for deepfake detection that combines transformer-based architectures (Swin Transformers, ViTs) with texture-based methods. The approach incorporates innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation to enhance generalization and focus on high-impact facial regions like eyes and mouth. SFANet achieves state-of-the-art performance on the DFWild-Cup dataset, demonstrating the effectiveness of hybrid models in addressing deepfake detection challenges.

Abstract

Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.


Key findings
The SFANet ensemble model achieved state-of-the-art performance on the DFWild-Cup dataset, reaching an AUC of 0.9822 and an accuracy of 0.9613. The research demonstrates that combining transformer-based architectures with texture-based methods, along with targeted data processing techniques, significantly improves generalization and robustness in deepfake detection. The hybrid model effectively leverages the complementary strengths of different approaches for reliable real-world application.
Approach
The proposed SFANet is an ensemble framework that processes input images (frames from videos). It first uses BiSeNet for face segmentation; if key facial components are detected, the image is processed by parallel SwinAtten and SwinFusion models, whose scores are averaged. If facial components are missing, an SFnet model handles the image. This hybrid approach combines spatial and frequency domain features with attention mechanisms for robust detection.
Datasets
DFWild-Cup dataset (a subset of Celeb-DF-v1, Celeb-DF-v2, FaceForensics++, DeepfakeDetection, FaceShifter, UADFV, Deepfake Detection Challenge Preview, Deepfake Detection Challenge)
Model(s)
Swin Transformers, Vision Transformers (ViTs), BiSeNet, SFnet, SFPnet, SwinAtten, SwinFusion, Xception, EfficientNet-B7, SimCLR
Author countries
India, Australia