Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives

Authors: Matheus Martins Batista

Published: 2025-04-03 02:10:27+00:00

AI Summary

This research compares deepfake detection methods, focusing on the GenConViT model's performance against other architectures in the DeepfakeBenchmark. GenConViT, after fine-tuning, demonstrated superior accuracy (93.82%) and generalization on the DeepSpeak dataset.

Abstract

The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.


Key findings
After fine-tuning, GenConViT achieved superior accuracy (93.82%) and AUC (0.993) on the DeepSpeak dataset compared to other models in the DeepfakeBenchmark. The study also highlighted the impact of fine-tuning and the choice of architecture (AE vs. VAE) on the model's performance and generalization ability across different datasets.
Approach
The research compares existing deepfake detection models, primarily focusing on the GenConViT model, a hybrid architecture combining convolutional neural networks (CNNs) and transformers. The models were evaluated using standard metrics on the WildDeepfake and DeepSpeak datasets, with fine-tuning performed on the DeepSpeak dataset for a comparative analysis.
Datasets
WildDeepfake, DeepSpeak, Celeb-DF (v2), DFDC, FaceForensics++, DeepfakeTIMIT
Model(s)
GenConViT (including its constituent networks: AE and VAE with ConvNeXt and Swin Transformer backbones), XceptionNet, EfficientNetB4, Meso4Inception, Spatial-Phase Shallow Learning (SPSL), Uncovering Common Features (UCF)
Author countries
Brazil