Explainable Deepfake Video Detection using Convolutional Neural Network and CapsuleNet

Authors: Gazi Hasin Ishrak, Zalish Mahmud, MD. Zami Al Zunaed Farabe, Tahera Khanom Tinni, Tanzim Reza, Mohammad Zavid Parvez

Published: 2024-04-19 12:21:27+00:00

AI Summary

This research proposes a hybrid deepfake video detection model combining Convolutional Neural Networks (CNNs), Capsule Networks, and Long Short-Term Memory (LSTM) networks. The model aims not only for accurate detection but also for explainability using Grad-CAM to visualize the model's decision-making process.

Abstract

Deepfake technology, derived from deep learning, seamlessly inserts individuals into digital media, irrespective of their actual participation. Its foundation lies in machine learning and Artificial Intelligence (AI). Initially, deepfakes served research, industry, and entertainment. While the concept has existed for decades, recent advancements render deepfakes nearly indistinguishable from reality. Accessibility has soared, empowering even novices to create convincing deepfakes. However, this accessibility raises security concerns.The primary deepfake creation algorithm, GAN (Generative Adversarial Network), employs machine learning to craft realistic images or videos. Our objective is to utilize CNN (Convolutional Neural Network) and CapsuleNet with LSTM to differentiate between deepfake-generated frames and originals. Furthermore, we aim to elucidate our model's decision-making process through Explainable AI, fostering transparent human-AI relationships and offering practical examples for real-life scenarios.


Key findings
The proposed hybrid model achieved 88% validation accuracy on the DFDC dataset. Grad-CAM visualization helped explain the model's decisions by highlighting relevant facial regions in both real and fake videos. The model showed improved performance compared to a referenced combined-model approach.
Approach
The authors address deepfake detection by using a hybrid model that leverages the strengths of CNNs for feature extraction, Capsule Networks to mitigate information loss from pooling layers, and LSTM for temporal analysis of video frames. Explainable AI techniques, specifically Grad-CAM, are employed to interpret the model's predictions.
Datasets
Deepfake Detection Challenge (DFDC) dataset
Model(s)
Hybrid model: CNN, Capsule Network, LSTM; Pre-trained models: Xception, InceptionV3 (mentioned for comparison)
Author countries
Bangladesh