Video Face Manipulation Detection Through Ensemble of CNNs

Authors: Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Published: 2020-04-16 14:19:40+00:00

AI Summary

This paper proposes a video face manipulation detection method using an ensemble of Convolutional Neural Networks (CNNs). Different CNN models are created from an EfficientNetB4 base network by incorporating attention layers and siamese training. The ensemble approach achieves promising results on two publicly available datasets.

Abstract

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.


Key findings
The ensemble of models outperforms the XceptionNet baseline on both datasets in terms of AUC and LogLoss. The attention mechanism helps highlight informative regions of the face for manipulation detection. Siamese training improves feature representation, leading to better separation of real and fake samples.
Approach
The authors address the problem by ensembling multiple CNN models derived from EfficientNetB4. These models incorporate attention mechanisms and are trained using both end-to-end and siamese training strategies. The final prediction is the average of the individual model outputs.
Datasets
FF++ and DFDC datasets
Model(s)
EfficientNetB4, EfficientNetB4Att (with attention mechanism), and variations trained with siamese training (EfficientNetB4ST, EfficientNetB4AttST). XceptionNet is used as a baseline.
Author countries
Italy