Towards Benchmarking and Evaluating Deepfake Detection

View on arXiv ← Back to list

Authors: Chenhao Lin, Jingyi Deng, Pengbin Hu, Chao Shen, Qian Wang, Qi Li

Published: 2022-03-04 03:12:15+00:00

AI Summary

This research paper establishes a comprehensive benchmark for deepfake detection by creating a challenging dataset with diverse manipulation methods and implementing a repeatable evaluation procedure using multiple metrics. The benchmark allows for sound comparison of existing deepfake detection approaches and measures their progress.

Abstract

Deepfake detection automatically recognizes the manipulated medias through the analysis of the difference between manipulated and non-altered videos. It is natural to ask which are the top performers among the existing deepfake detection approaches to identify promising research directions and provide practical guidance. Unfortunately, it's difficult to conduct a sound benchmarking comparison of existing detection approaches using the results in the literature because evaluation conditions are inconsistent across studies. Our objective is to establish a comprehensive and consistent benchmark, to develop a repeatable evaluation procedure, and to measure the performance of a range of detection approaches so that the results can be compared soundly. A challenging dataset consisting of the manipulated samples generated by more than 13 different methods has been collected, and 11 popular detection approaches (9 algorithms) from the existing literature have been implemented and evaluated with 6 fair-minded and practical evaluation metrics. Finally, 92 models have been trained and 644 experiments have been performed for the evaluation. The results along with the shared data and evaluation methodology constitute a benchmark for comparing deepfake detection approaches and measuring progress.

Key findings

The performance of all 11 deepfake detection approaches dropped significantly on the realistic ID test set, highlighting the gap between current methods and real-world needs. No single method demonstrated comprehensive superiority across all evaluation metrics. The dataset inconsistencies in previous studies led to unfair comparisons.

Approach

The authors created a benchmark dataset comprising manipulated samples from over 13 different methods and evaluated 11 popular detection approaches (9 algorithms) using 6 metrics. They re-implemented the algorithms, trained them on the same data, and performed 644 experiments for fair comparison.

Datasets

UADFV, DeepFake-TIMIT, Celeb-DF, DeeperForensics-1.0, FaceForensics++, DFDC, ForgeryNet, and a private dataset; an Imperceptible and Diverse (ID) test set was created from these datasets for evaluating robustness.

Model(s)

Headpose, FWA-Resnet50, Face X-ray, Xception, Mesonet-4, MesoInception-4, Patch-Resnet-Layer1, Patch-Xception-Block2, FFD, Multiple-attention, Conv LSTM

Author countries

China, China, China, China, China, China

← Previous