SFANet: Spatial-Frequency Attention Network for Deepfake Detection

View on arXiv ← Back to list

Authors: Vrushank Ahire, Aniruddh Muley, Shivam Zample, Siddharth Verma, Pranav Menon, Surbhi Madan, Abhinav Dhall

Published: 2025-10-06 09:35:57+00:00

AI Summary

The paper introduces SFANet, a novel ensemble framework for robust deepfake detection that addresses generalization issues across diverse generation techniques. This framework combines transformer-based architectures (like Swin and ViT) with texture-based methods, utilizing innovative techniques such as sequential training, frequency splitting, and face segmentation. The resulting model achieved state-of-the-art performance on the diverse DFWild-Cup dataset.

Abstract

Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.

Key findings

The final SFANet ensemble achieved state-of-the-art performance on the DFWild-Cup dataset, reaching an accuracy of 96.13% and an AUC of 0.9822. The ensemble framework, comprising SwinAtten, SwinFusion, and SFnet components, proved highly robust, showing that combining spatial and frequency features effectively detects subtle manipulation artifacts. Specialized techniques like sequential training and focusing attention on critical facial regions (eyes/mouth) significantly enhanced generalization and mitigated dataset imbalance issues.

Approach

They employ a hybrid ensemble framework (SFANet) combining spatial feature extraction (using Swin Transformers/ViTs) with frequency domain feature extraction (using 2D FFT) and attention mechanisms (SwinAtten). The process includes specialized techniques like sequential training on clustered fake data to handle imbalance and BiSeNet-based face segmentation to focus on high-impact regions (eyes and mouth). The final prediction averages scores from components like SwinAtten, SwinFusion, and SFnet, depending on face segmentation results.

Datasets

DFWild-Cup dataset, Celeb-DF-v1, Celeb-DF-v2, FaceForensics++, DeepfakeDetection, FaceShifter, UADFV, Deepfake Detection Challenge Preview, Deepfake Detection Challenge

Model(s)

Swin Transformers (Large), Vision Transformers (ViT), SFnet, SwinAtten, SwinFusion, Xception, EfficientNet-B7, SimCLR, BiSeNet (for segmentation)

Author countries

India, Australia

← Previous