1M-Deepfakes Detection Challenge

View on arXiv ← Back to list

Authors: Zhixi Cai, Abhinav Dhall, Shreya Ghosh, Munawar Hayat, Dimitrios Kollias, Kalin Stefanov, Usman Tariq

Published: 2024-09-11 03:43:53+00:00

AI Summary

The paper introduces the 1M-Deepfakes Detection Challenge, leveraging the AV-Deepfake1M dataset (over 1 million manipulated videos) to advance research in deepfake detection and localization. The challenge focuses on both binary classification of deepfakes and identifying the manipulated segments within videos.

Abstract

The detection and localization of deepfake content, particularly when small fake segments are seamlessly mixed with real videos, remains a significant challenge in the field of digital media security. Based on the recently released AV-Deepfake1M dataset, which contains more than 1 million manipulated videos across more than 2,000 subjects, we introduce the 1M-Deepfakes Detection Challenge. This challenge is designed to engage the research community in developing advanced methods for detecting and localizing deepfake manipulations within the large-scale high-realistic audio-visual dataset. The participants can access the AV-Deepfake1M dataset and are required to submit their inference results for evaluation across the metrics for detection or localization tasks. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection and localization systems. Evaluation scripts, baseline models, and accompanying code will be available on https://github.com/ControlNet/AV-Deepfake1M.

Key findings

The challenge saw significant participation (191 teams, 1034 submissions). A disparity in submissions was observed, with more focus on detection than localization. Top-performing teams employed various audio-visual fusion techniques and advanced architectures to achieve high performance on both detection and localization tasks.

Approach

The challenge uses the AV-Deepfake1M dataset, which contains a large number of videos with both audio and visual manipulations. Participants develop models to perform two tasks: deepfake detection (binary classification) and temporal localization (identifying manipulated segments).

Datasets

AV-Deepfake1M dataset

Model(s)

Various models were used by challenge participants, including but not limited to: Audio-Visual Local-Global Interaction Module (AV-LG Module), Vision Transformers, Wav2Vec-XLS-R, gMLP, UMMAFormer, BYOL-A, TSN, InternVideo.

Author countries

Australia, Australia, Australia, United States, United Kingdom, Australia, United Arab Emirates

← Previous