Inclusion 2024 Global Multimedia Deepfake Detection Challenge: Towards Multi-dimensional Face Forgery Detection

Authors: Yi Zhang, Weize Gao, Changtao Miao, Man Luo, Jianshu Li, Wenzhong Deng, Zhe Li, Bingyu Hu, Weibin Yao, Yunfeng Diao, Wenbo Zhou, Tao Gong, Qi Chu

Published: 2024-12-30 09:58:27+00:00

AI Summary

This paper presents the Global Multimedia Deepfake Detection Challenge, aiming to detect manipulated images and audio-video content. The challenge attracted 1500 teams and this paper analyzes the top 3 solutions from each of two tracks (image and audio-video deepfake detection).

Abstract

In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invite the top 20 teams to present their solutions to the challenge, from which the top 3 teams are awarded prizes in the grand finale. In this paper, we present the solutions from the top 3 teams of the two tracks, to boost the research work in the field of image and audio-video forgery detection. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection systems and we encourage participants to open source their methods.


Key findings
While the top teams achieved high AUC scores, significant performance disparities exist, especially at low false positive rates. The true positive rate at low false positive rates remains suboptimal, highlighting the need for further research to improve generalization and robustness in deepfake detection.
Approach
The challenge used a two-track approach: Track 1 focused on image deepfake detection, and Track 2 on audio-video deepfake detection. Winning solutions employed various techniques including data augmentation, model ensembling, and multimodal feature fusion to improve generalization and detection accuracy.
Datasets
MultiFF dataset (MultiFFI for images, MultiFFV for audio-video); other datasets mentioned for comparison include FaceForensics++, Celeb-DF, DiFF, LAV-DF, AV-Deepfake1M, VoxCeleb, CelebV-HQ, VFHQ, VCTK, TalkingHead, and LJSpeech.
Model(s)
Various models were used by the competing teams, including EfficientNet, ConvNeXt, MobileNet, Xception, Mobileone, Swin Transformer, SyncNet, VideoMAE-Base, and Support Vector Machine (SVM). Specific architectures varied across teams.
Author countries
China, Singapore