FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection

Authors: Gil Knafo, Ohad Fried

Published: 2022-12-01 18:56:31+00:00

AI Summary

FakeOut is a novel multi-modal video deepfake detection approach that leverages a self-supervised, out-of-domain backbone trained on non-deepfake videos and adapted to the deepfake domain. This approach achieves state-of-the-art results in cross-dataset generalization, demonstrating the surprising effectiveness of out-of-domain training for robust deepfake detection.

Abstract

Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training. In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., not especially featuring speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.


Key findings
FakeOut achieves state-of-the-art results on cross-dataset generalization tasks, particularly excelling in detecting deepfakes with unseen manipulation techniques. The use of out-of-domain, self-supervised pre-training significantly improves robustness, surpassing methods relying solely on in-domain, supervised training. The inclusion of audio modality further enhances performance.
Approach
FakeOut uses a multi-modal backbone pre-trained in a self-supervised manner on out-of-domain videos (HowTo100M and AudioSets). This backbone is then fine-tuned on a deepfake dataset (FaceForensics++) using a supervised approach, incorporating both audio and video modalities for improved robustness and generalization.
Datasets
HowTo100M, AudioSets (pre-training); FaceForensics++, DeepFake Detection Challenge (DFDC), DeeperForensics, FaceShifter, DeepfakeTIMIT, FakeAVCeleb (fine-tuning and testing)
Model(s)
Temporal Shift Module (TSM) with ResNet50 (or doubled channels) for video, ResNet50 for audio. A multi-layer perceptron is used as a classification head.
Author countries
Israel