AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

Authors: Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall

Published: 2025-07-28 07:27:42+00:00

AI Summary

AV-Deepfake1M++ is a large-scale (2 million video clips) audio-visual deepfake benchmark dataset with diverse manipulation strategies and real-world perturbations. It aims to advance deepfake detection research by providing a more realistic and challenging evaluation setting.

Abstract

The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.


Key findings
The 2025 1M-Deepfakes Detection Challenge, using AV-Deepfake1M++, revealed significant performance gaps in deepfake detection, particularly for temporal localization. Results showed that methods performing well on previous datasets struggled with the new perturbations and generation pipelines, highlighting the dataset's challenge and contribution to the field.
Approach
The authors created AV-Deepfake1M++ by extending AV-Deepfake1M. They incorporated diverse deepfake generation methods (nine state-of-the-art models for audio and visual manipulation) and numerous audio-visual perturbations to simulate real-world scenarios. The dataset is split into training, validation, and two test sets.
Datasets
VoxCeleb2, LRS3, EngageNet
Model(s)
UNKNOWN (The paper benchmarks existing state-of-the-art methods on the dataset, not proposing a new model)
Author countries
Australia, United Arab Emirates, India