AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

Authors: Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall

Published: 2025-07-28 07:27:42+00:00

AI Summary

This paper introduces AV-Deepfake1M++, a large-scale audio-visual deepfake benchmark comprising 2 million video clips with diversified manipulation strategies and extensive audio-visual perturbations. It details the data generation strategies, which include various state-of-the-art deepfake models and real-world perturbations, and provides a benchmark using existing detection methods. The authors aim for this dataset to facilitate research in the deepfake domain and are hosting a detection challenge based on it.

Abstract

The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.


Key findings
The AV-Deepfake1M++ dataset presents a significant challenge to existing deepfake detection methods. While the top team achieved a 0.9783 AUC for video-level classification, baseline models like Xception performed poorly (0.5509 AUC). Temporal localization proved particularly difficult, with baseline models such as BA-TFD+ experiencing a dramatic performance collapse, highlighting how new perturbations and synthesis pipelines invalidate prior method designs and showing a substantial gap for future research.
Approach
The authors created AV-Deepfake1M++ by sourcing unmanipulated real videos from VoxCeleb2, LRS3, and EngageNet. Deepfakes were generated using nine state-of-the-art visual (e.g., LatentSync, Diff2Lip, TalkLip) and audio (e.g., VITS, YourTTS, F5TTS, XTTSv2) models, applying diverse manipulation strategies like insert, replace, and delete, guided by an LLM for semantic editing. Crucially, the dataset incorporates 15 video-level and 11 audio-level real-world perturbations (e.g., compression, noise, stutter) to mimic online video conditions.
Datasets
AV-Deepfake1M++ (created and benchmarked), with original real videos sourced from VoxCeleb2, LRS3, and EngageNet.
Model(s)
Xception (for video-level classification), BA-TFD+, BA-TFD (for temporal localization). These models were used as baselines for benchmarking deepfake detection performance on the proposed dataset.
Author countries
Australia, United Arab Emirates, India