Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion

Authors: Shruti Agarwal, Liwen Hu, Evonne Ng, Trevor Darrell, Hao Li, Anna Rohrbach

Published: 2021-12-21 01:57:04+00:00

AI Summary

This research introduces a novel multi-modal approach for video falsification detection that analyzes word-conditioned facial movements. It leverages Action Units (AUs) to capture person-specific biometric patterns in facial expressions and head movements associated with specific words, addressing both deepfakes and cheapfakes.

Abstract

In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous facial movements corresponding to the spoken words. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a person's face and head movement as opposed to deep CNN features, and we are the first to use word-conditioned facial motion analysis. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.


Key findings
The proposed method outperforms state-of-the-art deepfake detection methods on a range of fake videos, including those with no video manipulation. Word-conditioned analysis proves superior to phoneme-based analysis, demonstrating the importance of semantic information. The approach also offers interpretability by revealing person-specific word-gesture patterns indicative of authenticity.
Approach
The approach uses audio transcription to align spoken words with video frames. Action Units (AUs) representing facial movements are extracted for each word, and person-specific linear classifiers are trained to distinguish real from fake facial movements for each word. Final video classification is based on aggregating word-level scores.
Datasets
A dataset comprising real and fake videos of US politicians (Obama, Trump, Biden, Harris) and TV talk show hosts (Oliver, O'Brien). Fake videos include those created via audio dubbing, Wav2Lip lip-sync, impersonators, FaceSwap, and in-the-wild lip-sync deepfakes.
Model(s)
Person-specific linear logistic regression classifiers are trained for each word, using Action Unit (AU) features as input. A geometric mean of word-level scores is used for final video classification.
Author countries
USA