Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

Authors: Tharun Anand, Siva Sankar Sajeev, Pravin Nair

Published: 2025-03-28 03:49:00+00:00

AI Summary

This paper introduces the first deepfake detection approach explicitly designed for localized facial manipulations. It leverages spatiotemporal representations guided by facial action units (AUs), fusing them via cross-attention to effectively encode subtle changes, achieving a 20% accuracy improvement over state-of-the-art methods.

Abstract

With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.


Key findings
The proposed method significantly outperforms state-of-the-art deepfake detection methods on videos with localized edits, achieving a 20% accuracy improvement. It also demonstrates competitive performance on standard datasets, highlighting its robustness and generalization ability across diverse manipulation types. The model shows resilience to common video perturbations.
Approach
The approach uses a cross-attention-based fusion of representations learned from pretext tasks: random masking and AU detection. This creates an embedding that captures subtle localized changes in deepfake videos. The model is trained on the FF++ dataset but generalizes well to newer deepfake methods.
Datasets
FaceForensics++ (FF++), CelebV-HQ
Model(s)
Video Masked Autoencoders (VideoMAE) based encoder-decoder architecture with a cross-attention mechanism for fusing features from masked frame reconstruction and AU detection pretext tasks.
Author countries
India