A Timely Survey on Vision Transformer for Deepfake Detection

Authors: Zhikan Wang, Zhongyao Cheng, Jiajie Xiong, Xun Xu, Tianrui Li, Bharadwaj Veeravalli, Xulei Yang

Published: 2024-05-14 09:33:04+00:00

AI Summary

This survey paper provides a comprehensive overview of Vision Transformer (ViT)-based deepfake detection models, categorizing them into standalone, sequential, and parallel architectures. It analyzes existing research and highlights future directions in this rapidly evolving field.

Abstract

In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (ViT)-based approaches showcasing superior performance in generality and efficiency. This survey presents a timely overview of ViT-based deepfake detection models, categorized into standalone, sequential, and parallel architectures. Furthermore, it succinctly delineates the structure and characteristics of each model. By analyzing existing research and addressing future directions, this survey aims to equip researchers with a nuanced understanding of ViT's pivotal role in deepfake detection, serving as a valuable reference for both academic and practical pursuits in this domain.


Key findings
The survey reveals that ViT-based approaches show superior performance in deepfake detection. However, challenges remain in model drift, data scarcity, temporal consistency, and bias. Future research should focus on improving model explainability, generalization, and multi-modal approaches.
Approach
The survey categorizes existing ViT-based deepfake detection models into standalone, sequential, and parallel architectures based on how the ViT is used. It then analyzes the strengths, limitations, and future research directions of these approaches.
Datasets
FaceForensics++, DeeperForensics-1.0, Celeb-DF-v1, Celeb-DF-v2, DFD, DFDC, TIMIT, RFF, DRFFD, MS-Celeb-1M, SR-DF, UADFV
Model(s)
Identity Consistency Transformer (ICT), Unsupervised Inconsistency-Aware method based on Vision Transformer (UIA-ViT), ViT-Distillation, Shallow ViT, Convolutional Vision Transformer (CVIT), Khan et al.'s model, Wang et al.'s model, MARLIN, Interpretable Spatial-Temporal Video Transformer (ISTVT), Multi-modal Multi-scale Transformer (M2TR), Xue et al.'s model, Generative Convolutional Vision Transformer (GenConViT), David et al.'s model, Zhao et al.'s model, EfficientNet
Author countries
Singapore, China