FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Authors: Hasam Khalid, Shahroz Tariq, Minha Kim, Simon S. Woo

Published: 2021-08-11 07:49:36+00:00

Comment: Part of Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021)

AI Summary

This paper introduces FakeAVCeleb, a novel audio-video multimodal deepfake dataset featuring synthesized lip-synced fake audios alongside deepfake videos of ethnically diverse celebrities. The dataset was generated using popular deepfake creation methods to address racial bias and the lack of high-quality multimodal deepfake data. Experiments with state-of-the-art detectors demonstrate the dataset's challenging nature and its utility for developing robust multimodal deepfake detection methods.

Abstract

While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.


Key findings
The FakeAVCeleb dataset proved to be highly challenging for existing state-of-the-art unimodal, ensemble-based, and multimodal deepfake detection methods, with an average AUC score of around 65% for video-only detection methods. The poor detection performance, even by SOTA models, indicates the realistic quality of the generated fake audios and videos. This highlights the critical need for further research and development of more effective and advanced multimodal deepfake detection methods.
Approach
The authors created FakeAVCeleb by selecting 500 real celebrity videos from the VoxCeleb2 dataset, ensuring diversity across ethnic backgrounds, gender, and age. They generated four types of deepfake audio-video combinations (ARVR, AFVR, ARVF, AFVF) using face-swapping (Faceswap, FSGAN), facial reenactment (Wav2Lip), and real-time voice cloning (SV2TTS). A facial recognition service (Face++) was used for similarity matching, and generated samples underwent manual inspection to ensure realistic quality and accurate lip-sync.
Datasets
VoxCeleb2, UADFV, DeepfakeTIMIT, FF++, Google DFD, DeeperForensics-1.0, DFDC, Celeb-DF, KoDF, FakeAVCeleb
Model(s)
VGG, Meso4, EfficientNet-B0, Xception
Author countries
South Korea