Combining EfficientNet and Vision Transformers for Video Deepfake Detection

Authors: Davide Coccomini, Nicola Messina, Claudio Gennaro, Fabrizio Falchi

Published: 2021-07-06 13:35:11+00:00

AI Summary

This paper proposes two novel video deepfake detection architectures combining EfficientNet and Vision Transformers. These models achieve state-of-the-art performance on the DeepFake Detection Challenge (DFDC) dataset without using distillation or ensemble methods, employing a simple voting scheme for multi-face video handling.

Abstract

Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).


Key findings
The proposed models achieve an AUC of 0.951 and an F1 score of 88.0% on the DFDC dataset, comparable to state-of-the-art methods but without using distillation or ensembling. The Convolutional Cross ViT architecture, particularly with EfficientNet B0, shows superior performance, highlighting the benefits of multi-scale analysis.
Approach
The authors propose two architectures: EfficientViT, using EfficientNet B0 for feature extraction and a Transformer encoder for classification, and Convolutional Cross ViT, incorporating multi-scale processing with both small and large patches via cross-attention. A voting scheme aggregates predictions across multiple faces in a video.
Datasets
DeepFake Detection Challenge (DFDC) dataset and FaceForensics++ dataset
Model(s)
EfficientNet B0, Vision Transformers (ViT), Convolutional Vision Transformer (from Wodajo et al.), EfficientViT, Convolutional Cross ViT
Author countries
Italy