Towards Benchmarking and Evaluating Deepfake Detection

Authors: Chenhao Lin, Jingyi Deng, Pengbin Hu, Chao Shen, Qian Wang, Qi Li

Published: 2022-03-04 03:12:15+00:00

AI Summary

This paper establishes a comprehensive and consistent benchmark for evaluating existing deepfake detection approaches due to inconsistencies in prior evaluation conditions. It collects a challenging dataset from over 13 manipulation methods and re-implements 11 popular detection algorithms from the literature. These methods are then evaluated using 6 fair and practical metrics, with 92 models trained and 644 experiments performed, to provide a sound comparison framework.

Abstract

Deepfake detection automatically recognizes the manipulated medias through the analysis of the difference between manipulated and non-altered videos. It is natural to ask which are the top performers among the existing deepfake detection approaches to identify promising research directions and provide practical guidance. Unfortunately, it's difficult to conduct a sound benchmarking comparison of existing detection approaches using the results in the literature because evaluation conditions are inconsistent across studies. Our objective is to establish a comprehensive and consistent benchmark, to develop a repeatable evaluation procedure, and to measure the performance of a range of detection approaches so that the results can be compared soundly. A challenging dataset consisting of the manipulated samples generated by more than 13 different methods has been collected, and 11 popular detection approaches (9 algorithms) from the existing literature have been implemented and evaluated with 6 fair-minded and practical evaluation metrics. Finally, 92 models have been trained and 644 experiments have been performed for the evaluation. The results along with the shared data and evaluation methodology constitute a benchmark for comparing deepfake detection approaches and measuring progress.


Key findings
The evaluation revealed that existing deepfake detection methods experience a significant performance drop on realistic and challenging datasets, failing to meet real-world application requirements. Under strictly uniform evaluation conditions, no single method demonstrated comprehensive superiority across detection ability, generalization, robustness, and practicability. Different methods showed specific advantages, such as Multiple-attention achieving the best AUC for detection performance, while Patch-Xception-Block2 and Patch-Resnet-Layer1 offered a good balance of detection ability, low inference time, and memory consumption.
Approach
The authors established a comprehensive benchmark for deepfake detection by collecting a challenging dataset, including an 'Imperceptible and Diverse (ID) Test Set', designed to simulate real-world scenarios. They re-implemented 11 popular deepfake detection algorithms from the literature, trained them uniformly, and evaluated their performance using six fair-minded metrics, including robustness, practicability, and efficiency, in addition to detection ability and generalization.
Datasets
UADFV, DeepFake-TIMIT, Celeb-DF-v2, DeeperForensics-1.0, FaceForensics++, DFDC, ForgeryNet, and a custom Imperceptible and Diverse (ID) Test Set built from hard examples of public datasets and a private dataset.
Model(s)
Headpose, FWA-Resnet50, Face X-ray, Xception, Mesonet-4, MesoInception-4, Patch Resnet Layer1, Patch Xception Block2, FFD, Multiple-attention, Conv LSTM.
Author countries
China