DepthFake: a depth-based strategy for detecting Deepfake videos

Authors: Luca Maiano, Lorenzo Papa, Ketbjano Vocaj, Irene Amerini

Published: 2022-08-23 16:38:25+00:00

AI Summary

This paper proposes DepthFake, a deepfake detection method that integrates depth maps with RGB images to improve detection accuracy. By leveraging depth inconsistencies introduced by deepfake generation, DepthFake achieves a significant performance boost compared to RGB-only methods, particularly on challenging deepfake types.

Abstract

Fake content has grown at an incredible rate over the past few years. The spread of social media and online platforms makes their dissemination on a large scale increasingly accessible by malicious actors. In parallel, due to the growing diffusion of fake image generation methods, many Deep Learning-based detection techniques have been proposed. Most of those methods rely on extracting salient features from RGB images to detect through a binary classifier if the image is fake or real. In this paper, we proposed DepthFake, a study on how to improve classical RGB-based approaches with depth-maps. The depth information is extracted from RGB images with recent monocular depth estimation techniques. Here, we demonstrate the effective contribution of depth-maps to the deepfake detection task on robust pre-trained architectures. The proposed RGBD approach is in fact able to achieve an average improvement of 3.20% and up to 11.7% for some deepfake attacks with respect to standard RGB architectures over the FaceForensic++ dataset.


Key findings
The integration of depth maps consistently improved deepfake detection accuracy across various deepfake generation methods. The proposed RGBD approach achieved an average improvement of 3.20% and up to 11.7% compared to standard RGB-only architectures. Preliminary results also suggest that using grayscale and depth information can achieve comparable or even better results than RGBD while minimizing inference time impact.
Approach
DepthFake first estimates depth maps from input RGB images using a pre-trained monocular depth estimation model. These depth maps are then concatenated with the RGB images, and a pre-trained convolutional neural network (Xception) is used to classify frames as real or fake.
Datasets
FaceForensics++
Model(s)
Xception, ResNet50, MobileNet-V1 (primarily Xception for main results)
Author countries
Italy