Tell me Habibi, is it Real or Fake?

View on arXiv ← Back to list

Authors: Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muhammad Haris Khan, Abhinav Dhall

Published: 2025-05-28 16:54:36+00:00

AI Summary

This paper introduces ArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring code-switching, addressing a critical gap in multilingual deepfake research. The dataset contains 387k videos and over 765 hours of real and fake videos generated using a novel pipeline integrating TTS and lip-sync models.

Abstract

Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It textbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.

Key findings

State-of-the-art deepfake detection models showed significantly lower performance on ArEnAV compared to monolingual datasets, highlighting the dataset's difficulty. Even human participants achieved only 60% accuracy in identifying deepfakes in ArEnAV, demonstrating the challenge posed by multilingual code-switched content.

Approach

The authors created a new dataset, ArEnAV, by manipulating existing Arabic-English code-switched videos. They used GPT-4.1-mini for transcript manipulation, multiple TTS and lip-sync models for audio-visual generation, and added various perturbations to enhance realism.

Datasets

YouTube videos from VisPer's Arabic Train subset, ArEnAV (created by the authors)

Model(s)

GPT-4.1-mini, XTTS-v2, OpenVoice-v2, Fairseq Arabic TTS, GPT-TTS, Diff2Lip, LatentSync, Whisper-V2, wav2vec2, Yolo-v5, Meso4, MesoInception4, Xception, BA-TFD, BA-TFD+, VideoLLaMA2, VideoLLaMA2.1-AV, XLSR-Mamba

Author countries

UAE, Australia

← Previous