Detecting Deepfakes with Metric Learning

Authors: Akash Kumar, Arnav Bhavsar

Published: 2020-03-19 09:44:23+00:00

AI Summary

This paper proposes a deep learning approach for deepfake detection that is particularly effective in high compression scenarios. It leverages metric learning with a triplet network architecture to enhance the feature space distance between real and fake video embedding vectors. The method demonstrates state-of-the-art performance on the Celeb-DF dataset and significantly improved accuracy on a highly compressed Neural Texture dataset.

Abstract

With the arrival of several face-swapping applications such as FaceApp, SnapChat, MixBooth, FaceBlender and many more, the authenticity of digital media content is hanging on a very loose thread. On social media platforms, videos are widely circulated often at a high compression factor. In this work, we analyze several deep learning approaches in the context of deepfakes classification in high compression scenario and demonstrate that a proposed approach based on metric learning can be very effective in performing such a classification. Using less number of frames per video to assess its realism, the metric learning approach using a triplet network architecture proves to be fruitful. It learns to enhance the feature space distance between the cluster of real and fake videos embedding vectors. We validated our approaches on two datasets to analyze the behavior in different environments. We achieved a state-of-the-art AUC score of 99.2% on the Celeb-DF dataset and accuracy of 90.71% on a highly compressed Neural Texture dataset. Our approach is especially helpful on social media platforms where data compression is inevitable.


Key findings
The proposed metric learning approach achieved a state-of-the-art AUC score of 99.2% on the Celeb-DF dataset. It significantly improved deepfake detection on highly compressed videos, achieving 90.71% accuracy on the Neural Texture dataset and an AUC of 92.9% on FF++ (c40), outperforming other deep learning methods in low-resolution conditions with fewer frames.
Approach
The approach involves extracting faces from video frames using MTCNN. It then employs a metric learning strategy with a triplet network architecture, utilizing FaceNet to generate 512-dimension face embeddings. Semi-hard triplets are mined online to train the network, aiming to distinctly separate clusters of real and fake video embeddings in the feature space.
Datasets
Celeb-DF, FF++ (FaceForensics++ c40 compression, including Neural Texture forgery type)
Model(s)
MTCNN, XceptionNet, FaceNet (for embeddings), Triplet Network, Random Forest (RF), Stochastic Gradient Descent (SGD)
Author countries
India