AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Authors: Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov

Published: 2023-11-26 14:17:51+00:00

AI Summary

The paper introduces AV-Deepfake1M, a large-scale dataset containing over 1 million audio-visual deepfake videos with various manipulation strategies (audio-only, video-only, and audio-visual). This dataset significantly challenges state-of-the-art deepfake detection and localization methods, showcasing the need for improved techniques.

Abstract

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .


Key findings
State-of-the-art deepfake detection and localization methods show significantly reduced performance on AV-Deepfake1M compared to previous datasets. Human subjects also struggled to detect the deepfakes, highlighting the realism and difficulty of the dataset. Training models on AV-Deepfake1M and then fine-tuning them on LAV-DF significantly improved performance on LAV-DF.
Approach
The authors created AV-Deepfake1M by employing a three-stage pipeline: LLM-driven transcript manipulation using ChatGPT, high-quality audio generation using VITS and YourTTS, and lip-synchronized video generation using TalkLip. This pipeline produces realistic deepfakes with insertions, deletions, and replacements.
Datasets
VoxCeleb2 (for real videos), various existing deepfake datasets for comparison (DFDC, DeeperForensics, FakeAVCeleb, LAV-DF, DF-Platter, etc.)
Model(s)
Meso4, MesoInception4, Xception, EfficientViT, BA-TFD, BA-TFD+, UMMAFormer, Pyannote, TriDet, ActionFormer, MDS, MARLIN, Video-LLaMA, CLAP, M2TR, LipForensics, SBI, InternVideo, VideoMAEv2, BYOL-A
Author countries
Australia, Australia, India, United States, Australia