Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles

Authors: Aagam Bakliwal, Amit D. Joshi

Published: 2023-10-25 06:00:37+00:00

AI Summary

This paper proposes a novel deepfake detection approach using an ensemble of 2D and 3D Convolutional Neural Networks. The 3D CNN captures spatiotemporal features, while the 2D CNN (EfficientNet) focuses on spatial features; these are combined using Voting Ensembles and Adaptive Weighted Ensembling, prioritizing the 3D model's output.

Abstract

In the dynamic realm of deepfake detection, this work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is uniquely tailored to capture spatiotemporal features via sliding filters, extending through both spatial and temporal dimensions. This configuration enables nuanced pattern recognition in pixel arrangement and temporal evolution across frames. Simultaneously, the 2D model leverages EfficientNet architecture, harnessing auto-scaling in Convolutional Neural Networks. Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted Ensembling. Strategic prioritization of the 3-dimensional model's output capitalizes on its exceptional spatio-temporal feature extraction. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfake generation's deceptive practices.


Key findings
The ensemble model significantly outperforms existing benchmarks in deepfake detection, achieving higher AUC and lower LogLoss scores. Prioritizing the 3D model's output improves performance due to its effective capture of spatiotemporal features. The Attention2D model also enhanced performance compared to the baseline EfficientNetB4.
Approach
The approach uses an ensemble of a 3D CNN (I3D, 3D ResNet34, MC3, and R(2+1)D) and a 2D CNN (EfficientNetB4 with attention mechanism). The models' predictions are combined using Voting Ensembles and Adaptive Weighted Ensembling, giving higher weight to the 3D model's output due to its superior spatiotemporal feature extraction capabilities.
Datasets
DFDC dataset
Model(s)
EfficientNetB4, I3D, 3D ResNet34, MC3, R(2+1)D. Ensemble methods used: Voting Ensembles and Adaptive Weighted Ensembling.
Author countries
India