Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

View on arXiv ← Back to list

Authors: Sayantan Das, Mojtaba Kolahdouzi, Levent Özparlak, Will Hickie, Ali Etemad

Published: 2023-06-12 05:49:23+00:00

AI Summary

This paper proposes MASDT, a novel deepfake video detection method using a pair of vision transformers pre-trained with a self-supervised masked autoencoding setup. One transformer learns spatial information from RGB frames, while the other learns temporal consistency from optical flow fields; the final prediction is a score-level fusion of both.

Abstract

We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup. Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames. Unlike most approaches where pre-training is performed on a generic large corpus of images, we show that by pre-training on smaller face-related datasets, namely Celeb-A (for the spatial learning component) and YouTube Faces (for the temporal learning component), strong results can be obtained. We perform various experiments to evaluate the performance of our method on commonly used datasets namely FaceForensics++ (Low Quality and High Quality, along with a new highly compressed version named Very Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains competitive results on Celeb-DFv2. Moreover, our method outperforms other methods in the area in a cross-dataset setup where we fine-tune our model on FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset generalization ability.

Key findings

MASDT achieves state-of-the-art results on FaceForensics++ (LQ, HQ, VLQ), competitive results on Celeb-DFv2, and shows strong cross-dataset generalization. Ablation studies highlight the importance of both spatial and temporal components and the effectiveness of score-level fusion.

Approach

MASDT uses two vision transformers: one trained on Celeb-A to learn spatial features from RGB frames, and another trained on YouTube Faces to learn temporal consistency from optical flow. These are pre-trained via masked autoencoding and then fine-tuned for deepfake detection, with their outputs fused for a final prediction.

Datasets

FaceForensics++ (Low Quality, High Quality, Very Low Quality), Celeb-DFv2, Celeb-A, YouTube Faces

Model(s)

Vision Transformer (ViT-B), PWC-Net (for optical flow)

Author countries

Canada, Netherlands

← Previous