The DeepFake Detection Challenge (DFDC) Dataset

Authors: Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, Cristian Canton Ferrer

Published: 2020-06-12 18:15:55+00:00

AI Summary

This paper introduces the DeepFake Detection Challenge (DFDC) dataset, a large-scale dataset of face-swapped videos created using various methods, designed to facilitate the training and evaluation of deepfake detection models. The authors also analyze the top submissions from the accompanying Kaggle competition, demonstrating that models trained on the DFDC dataset can generalize to real-world deepfakes.

Abstract

Deepfakes are a recent off-the-shelf manipulation technique that allows anyone to swap two identities in a single video. In addition to Deepfakes, a variety of GAN-based face swapping methods have also been published with accompanying code. To counter this emerging threat, we have constructed an extremely large face swap video dataset to enable the training of detection models, and organized the accompanying DeepFake Detection Challenge (DFDC) Kaggle competition. Importantly, all recorded subjects agreed to participate in and have their likenesses modified during the construction of the face-swapped dataset. The DFDC dataset is by far the largest currently and publicly available face swap video dataset, with over 100,000 total clips sourced from 3,426 paid actors, produced with several Deepfake, GAN-based, and non-learned methods. In addition to describing the methods used to construct the dataset, we provide a detailed analysis of the top submissions from the Kaggle contest. We show although Deepfake detection is extremely difficult and still an unsolved problem, a Deepfake detection model trained only on the DFDC can generalize to real in-the-wild Deepfake videos, and such a model can be a valuable analysis tool when analyzing potentially Deepfaked videos. Training, validation and testing corpuses can be downloaded from https://ai.facebook.com/datasets/dfdc.


Key findings
Models trained solely on the DFDC dataset demonstrated generalization to real-world deepfakes. The top-performing models in the DFDC competition employed ensemble methods and efficient architectures. The analysis highlights the challenge of deepfake detection and the value of large, ethically sourced datasets like DFDC for advancing the field.
Approach
The authors created a large-scale face-swapped video dataset (DFDC) using several deepfake and GAN-based methods. A Kaggle competition was organized using this dataset, and the top submissions were analyzed to evaluate the effectiveness of different deepfake detection approaches.
Datasets
DeepFake Detection Challenge (DFDC) dataset; this includes training, validation, and private test sets comprising over 100,000 video clips from 3,426 paid actors, generated with various Deepfake and GAN-based methods. The private test set also includes real "in-the-wild" deepfakes.
Model(s)
Various models were used by participants in the DFDC Kaggle competition, including EfficientNet, Xception, ResNet, SlowFast, and 3D CNNs. The paper does not focus on a specific model but rather analyzes the performance of different approaches.
Author countries
USA