Compressed Deepfake Video Detection Based on 3D Spatiotemporal Trajectories

Authors: Zongmei Chen, Xin Liao, Xiaoshuai Wu, Yanxiang Chen

Published: 2024-04-28 11:48:13+00:00

AI Summary

This paper introduces a deepfake video detection method robust to video compression, unlike existing methods that suffer performance degradation with compression. It leverages 3D spatiotemporal trajectories of facial landmarks, decoupling head movements from facial expressions for improved accuracy and robustness.

Abstract

The misuse of deepfake technology by malicious actors poses a potential threat to nations, societies, and individuals. However, existing methods for detecting deepfakes primarily focus on uncompressed videos, such as noise characteristics, local textures, or frequency statistics. When applied to compressed videos, these methods experience a decrease in detection performance and are less suitable for real-world scenarios. In this paper, we propose a deepfake video detection method based on 3D spatiotemporal trajectories. Specifically, we utilize a robust 3D model to construct spatiotemporal motion features, integrating feature details from both 2D and 3D frames to mitigate the influence of large head rotation angles or insufficient lighting within frames. Furthermore, we separate facial expressions from head movements and design a sequential analysis method based on phase space motion trajectories to explore the feature differences between genuine and fake faces in deepfake videos. We conduct extensive experiments to validate the performance of our proposed method on several compressed deepfake benchmarks. The robustness of the well-designed features is verified by calculating the consistent distribution of facial landmarks before and after video compression.Our method yields satisfactory results and showcases its potential for practical applications.


Key findings
The proposed method achieves state-of-the-art performance on compressed deepfake video detection benchmarks (FF++, DFDC, Celeb-DF). The robustness to compression is attributed to the use of 3D models for landmark tracking and the decoupling of head movements from facial expressions. The method also demonstrates high detection efficiency.
Approach
The approach uses a 3D model for robust facial landmark localization and tracking, decoupling head and facial movements. Spatiotemporal features are constructed and analyzed using phase space motion trajectories and a lightweight Transformer network for classification.
Datasets
FaceForensics++ (FF++), DFDC, Celeb-DF
Model(s)
Lightweight Transformer network, 3D Morphable Face Model (3DMM), Gradient tree boosting algorithm
Author countries
China