PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

View on arXiv ← Back to list

Authors: Yang Hou, Haitao Fu, Chuankai Chen, Zida Li, Haoyu Zhang, Jianjun Zhao

Published: 2024-05-14 06:40:05+00:00

AI Summary

The paper introduces PolyGlotFake, a novel multilingual and multimodal deepfake dataset containing videos in seven languages, generated using various cutting-edge techniques. This dataset addresses the limitations of existing datasets by offering a more realistic and diverse representation of current deepfake technology, enabling advancements in multimodal deepfake detection.

Abstract

With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.

Key findings

Experiments on PolyGlotFake show that state-of-the-art deepfake detectors trained on existing datasets perform significantly worse on PolyGlotFake, highlighting its challenge and value in advancing multimodal deepfake detection research. The dataset's superior visual and audio quality is also demonstrated through quantitative assessment.

Approach

PolyGlotFake is created by collecting real videos in seven languages, translating their content, and then using various state-of-the-art TTS, voice cloning, and lip-sync technologies to generate corresponding fake videos in those languages. The dataset includes detailed annotations on the techniques used for both audio and visual manipulations.

Datasets

PolyGlotFake (created by the authors), FakeAVCeleb, DFDC, FF++, Celeb-DF, DeeperForensics, KoDF, DF-Platter, UADFV, TIMI

Model(s)

MesoNet, MesoInception, Xception, EfficientNet-B4, Capsule, FFD, CORE, RECCE, DSP-FWA, F3Net, SRM, XRes (ensemble of Xception and ResNet), XTTS, Bark, FreeVC, Vall-E-X, Microsoft TTS, Tacotron, Wav2Lip, VideoRetalking, Whisper

Author countries

Japan

← Previous