MARLIN: Masked Autoencoder for facial video Representation LearnINg

Authors: Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, Munawar Hayat

Published: 2022-11-12 10:29:05+00:00

AI Summary

MARLIN is a self-supervised masked autoencoder for learning universal facial representations from videos. It reconstructs spatio-temporal facial details from masked regions, enabling transfer learning across diverse downstream tasks like deepfake detection, achieving performance gains over existing methods.

Abstract

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .


Key findings
MARLIN outperforms or is competitive with state-of-the-art methods on multiple downstream tasks including deepfake detection, facial attribute recognition, and facial expression recognition, even in low-data regimes. The Fasking strategy and adversarial training significantly improve performance. Qualitative results show MARLIN's focus on relevant facial features for accurate reconstruction and detection.
Approach
MARLIN uses a facial video masked autoencoder with a facial-guided masking strategy (Fasking). It reconstructs spatio-temporal facial details from densely masked regions (eyes, nose, mouth, etc.), leveraging adversarial training to improve reconstruction quality and learn robust features.
Datasets
YouTube Faces (YTF), CelebV-HQ, CMU-MOSEI, FaceForensics++ (FF++), LRS2
Model(s)
Vision Transformer (ViT) based masked autoencoder, with adversarial training.
Author countries
Australia, China, India