FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

Authors: Dat Nguyen, Marcella Astrid, Enjie Ghorbel, Djamila Aouada

Published: 2024-10-29 11:36:49+00:00

AI Summary

FakeFormer is a deepfake detection framework that improves Vision Transformers (ViTs) performance by introducing a learning-based local attention mechanism (L2-Att) which focuses on artifact-prone patches. This approach outperforms state-of-the-art methods in generalization and computational efficiency, requiring less training data.

Abstract

Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at url{https://github.com/10Ring/FakeFormer}.


Key findings
FakeFormer outperforms state-of-the-art deepfake detection methods in generalization and computational cost. The L2-Att module significantly improves ViT performance. The approach mitigates the need for large-scale training datasets.
Approach
FakeFormer extends Vision Transformers by adding a Learning-based Local Attention module (L2-Att). L2-Att predicts vulnerable patches (regions most likely containing blending artifacts) using blending-based data synthesis techniques, guiding the network's attention to these crucial areas for deepfake detection. The model is trained using only real data and pseudo-fakes.
Datasets
FF++, Celeb-DF (CDF1, CDF2), WildDeepfake (DFW), DFD, DFDCP, DFDC
Model(s)
Vision Transformer (ViT), Swin Transformer, FakeFormer (ViT-based), FakeSwin (Swin-based)
Author countries
Luxembourg