Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge

Authors: Xiaoya Zhu, Yibing Nan, Shiguo Lian

Published: 2025-08-15 13:24:47+00:00

AI Summary

This paper presents a deepfake image detection method utilizing a Swin Transformer V2-B classification network. The approach extensively employs online data augmentation and offline sample generation techniques to enhance training data diversity and model generalization. The method achieved an award of excellence in the 2024 Global Deepfake Image Detection Challenge.

Abstract

With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.


Key findings
The proposed data-driven approach, leveraging Swin Transformer V2-B and diverse data augmentation strategies, significantly improved model generalization and robustness against various deepfake attacks. The method achieved a high score of 0.96916 in the competition, leading to an award of excellence in deepfake image detection.
Approach
The method uses a Swin Transformer V2-B as the core classification network. It incorporates comprehensive data preprocessing, including online augmentations like random horizontal flips and AutoAugment, and offline generation of diverse negative samples through techniques such as random facial region cutout, local cropping, cartoonization, and sketching. Post-processing involving Dlib and OpenCV face detectors is applied to refine classification confidence for ambiguous predictions.
Datasets
MultiFF dataset (524K images), supplemented by synthetically generated datasets: Random facial region cutout (40K), Localized cropping (10K), Random grayscaling, translating and overlaying (10K), Cartoonization (10K), Sketching (5K), and Binarization (5K).
Model(s)
Swin Transformer V2-B (pre-trained on ImageNet1k) for image deepfake detection.
Author countries
China