KoDF: A Large-scale Korean DeepFake Detection Dataset

Authors: Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo Park, Gyeongsu Chae

Published: 2021-03-18 09:04:02+00:00

AI Summary

This paper introduces KoDF, a large-scale Korean deepfake detection dataset designed to address the underrepresentation of Asian subjects in existing datasets. KoDF contains a large number of real and synthesized videos, generated using multiple deepfake methods, and includes adversarial examples to enhance robustness in detection models.

Abstract

A variety of effective face-swap and face-reenactment methods have been publicized in recent years, democratizing the face synthesis technology to a great extent. Videos generated as such have come to be called deepfakes with a negative connotation, for various social problems they have caused. Facing the emerging threat of deepfakes, we have built the Korean DeepFake Detection Dataset (KoDF), a large-scale collection of synthesized and real videos focused on Korean subjects. In this paper, we provide a detailed description of methods used to construct the dataset, experimentally show the discrepancy between the distributions of KoDF and existing deepfake detection datasets, and underline the importance of using multiple datasets for real-world generalization. KoDF is publicly available at https://moneybrain-research.github.io/kodf in its entirety (i.e. real clips, synthesized clips, clips with adversarial attack, and metadata).


Key findings
The study demonstrates that no single existing deepfake detection dataset sufficiently represents the real-world distribution of deepfakes. Using multiple datasets, including KoDF, significantly improves the generalization performance of deepfake detection models, highlighting the dataset's value in complementing existing resources.
Approach
The authors created the KoDF dataset by recording videos from Korean subjects and generating deepfakes using six different synthesis models. They employed a rigorous quality control process and included adversarial examples to improve the dataset's usefulness for training robust deepfake detection models.
Datasets
KoDF (Korean DeepFake Detection Dataset), FF++, DFDC, GDFD, DF-1.0, UADFV, DeepfakeTIMIT, Celeb-DF
Model(s)
The paper does not specify models used for detection, but mentions using the winning model from the DeepFake Detection Challenge (DFDC) competition for evaluation purposes. For deepfake generation, they used FaceSwap, DeepFaceLab, FSGAN, First Order Motion Model (FOMM), Audio-driven Talking Face Head Pose (ATFHP), and Wav2Lip.
Author countries
Republic of Korea