Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection

Authors: Yaning Zhang, Qiufu Li, Zitong Yu, Linlin Shen

Published: 2024-12-28 14:00:27+00:00

AI Summary

This paper proposes a Distilled Transformer Network (DTN) for face forgery detection, addressing limitations of existing CNN and transformer-based methods. DTN incorporates a Mixture of Experts module, a Locally-enhanced Vision Transformer, and a Multi-attention Scaling module to improve the detection of both local and global forgery artifacts, surpassing state-of-the-art performance on five datasets.

Abstract

Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.


Key findings
The proposed DTN consistently outperforms state-of-the-art methods on five deepfake datasets in both within-dataset and cross-dataset evaluations. The model demonstrates robustness to common image corruptions and shows improved generalization to novel forgery techniques. Ablation studies confirm the contribution of each module in the DTN architecture.
Approach
The authors propose a Distilled Transformer Network (DTN) that uses a deepfake self-distillation scheme to generate soft labels. The DTN architecture integrates a Mixture of Experts module, a Locally-enhanced Vision Transformer (LEVT) with a Multi-attention Scaling (MAS) module to capture diverse and robust forgery features. This approach aims to address attention collapse and improve the generalizability of the model.
Datasets
FaceForensics++, Deepfake Detection Challenge (DFDC), Celeb-DF, DeeperForensics-1.0 (DF-1.0), Deepfake Detection Dataset (DFD)
Model(s)
Distilled Transformer Network (DTN) with VGG backbone, Mixture of Experts (MoE) module, Locally-enhanced Vision Transformer (LEVT) module, and Multi-attention Scaling (MAS) module.
Author countries
China