Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Authors: Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela

Published: 2025-12-03 19:00:07+00:00

AI Summary

This paper introduces Kinematic Model for facial motion Inconsistencies (KiMoI), a novel synthetic video generation method, to create training data with subtle kinematic inconsistencies for generalizable deepfake video detection. It leverages a Landmark Perturbation Network (LPN) to decompose facial landmark configurations into motion bases, which are then manipulated to break natural motion correlations. These sophisticated biomechanical flaws are introduced into pristine videos via face morphing, allowing a network trained on this data to achieve state-of-the-art generalization across popular benchmarks.

Abstract

Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.


Key findings
The proposed KiMoI method achieves state-of-the-art generalization in deepfake video detection, setting new records on DFDCP and DF40 datasets and yielding the highest average AUC across multiple benchmarks. Combining KiMoI's data-driven temporal pseudo-fakes with existing spatial artifacts significantly boosts performance, improving the average AUC score by over 4 points from a baseline model. The learned kinematic inconsistencies from the LPN are more semantically rich and crucial for reliable deepfake detection compared to simpler analytical noise models.
Approach
The method, KiMoI, generates pseudo-fake videos by introducing subtle kinematic inconsistencies into pristine footage. It uses a Landmark Perturbation Network (LPN), an autoencoder, to decompose facial landmark sequences into motion bases. These bases are then manipulated by adding Gaussian noise to selectively break natural correlations in facial movements. Finally, these altered landmark sequences guide a face morphing pipeline to distort facial regions in original frames, creating pseudo-fake videos with temporal artifacts for training deepfake detectors.
Datasets
FaceForensics++ (FF++) (c23 subset) for training, CelebV-HQ for LPN training, and Celeb-DFv2 (CDF), DFD, DFDCP, WildDeepFake (WDF), DeeperForensics-1.0 (DFo), and DF40 (BlendFace, FSGAN, MobileSwap subsets) for cross-dataset evaluation.
Model(s)
MARLIN encoder (ViT-B and ViT-L configurations) for deepfake detection. Landmark annotations are extracted using RetinaFace and SPIGA. The Landmark Perturbation Network (LPN) is implemented as a transformer network.
Author countries
Spain