Generalizable Deepfake Detection with Phase-Based Motion Analysis

Authors: Ekta Prashnani, Michael Goebel, B. S. Manjunath

Published: 2022-11-17 06:28:01+00:00

AI Summary

PhaseForensics, a deepfake video detection method, uses a phase-based motion representation of facial temporal dynamics to improve cross-dataset generalization and robustness to distortions and adversarial attacks. It leverages temporal phase variations in face sub-regions, providing a robust motion estimate less susceptible to cross-dataset variations and adversarial perturbations.

Abstract

We propose PhaseForensics, a DeepFake (DF) video detection method that leverages a phase-based motion representation of facial temporal dynamics. Existing methods relying on temporal inconsistencies for DF detection present many advantages over the typical frame-based methods. However, they still show limited cross-dataset generalization and robustness to common distortions. These shortcomings are partially due to error-prone motion estimation and landmark tracking, or the susceptibility of the pixel intensity-based features to spatial distortions and the cross-dataset domain shifts. Our key insight to overcome these issues is to leverage the temporal phase variations in the band-pass components of the Complex Steerable Pyramid on face sub-regions. This not only enables a robust estimate of the temporal dynamics in these regions, but is also less prone to cross-dataset variations. Furthermore, the band-pass filters used to compute the local per-frame phase form an effective defense against the perturbations commonly seen in gradient-based adversarial attacks. Overall, with PhaseForensics, we show improved distortion and adversarial robustness, and state-of-the-art cross-dataset generalization, with 91.2% video-level AUC on the challenging CelebDFv2 (a recent state-of-the-art compares at 86.9%).


Key findings
PhaseForensics achieves state-of-the-art cross-dataset generalization, reaching 91.2% video-level AUC on CelebDFv2. It also demonstrates improved robustness to spatial distortions and adversarial attacks compared to existing methods, particularly benefiting from the use of phase-based features instead of pixel intensities.
Approach
PhaseForensics extracts phase variations from band-pass components of a Complex Steerable Pyramid (CSP) applied to face sub-regions (e.g., lips). These spatio-temporally filtered phase features are then fed into a ResNet-18 feature extractor and a multi-scale temporal convolutional network for deepfake classification.
Datasets
FaceForensics++, CelebDFv2, DFDC, VideoForensicsHQ, DeeperForensics, FaceShifter, Lip Reading in the Wild (LRW), EVE
Model(s)
ResNet-18, Multi-scale Temporal Convolutional Network (MSTCN), Complex Steerable Pyramid (CSP)
Author countries
USA