Spatio-temporal Features for Generalized Detection of Deepfake Videos

Authors: Ipek Ganiyusufoglu, L. Minh Ngô, Nedko Savov, Sezer Karaoglu, Theo Gevers

Published: 2020-10-22 16:28:50+00:00

AI Summary

This paper proposes using spatio-temporal features, modeled by 3D CNNs, for deepfake video detection to improve generalization to new manipulation techniques. It demonstrates that spatio-temporal features capture shared attributes between deepfake methods, unlike spatial features which learn method-specific attributes, leading to superior generalization performance.

Abstract

For deepfake detection, video-level detectors have not been explored as extensively as image-level detectors, which do not exploit temporal data. In this paper, we empirically show that existing approaches on image and sequence classifiers generalize poorly to new manipulation techniques. To this end, we propose spatio-temporal features, modeled by 3D CNNs, to extend the generalization capabilities to detect new sorts of deepfake videos. We show that spatial features learn distinct deepfake-method-specific attributes, while spatio-temporal features capture shared attributes between deepfake methods. We provide an in-depth analysis of how the sequential and spatio-temporal video encoders are utilizing temporal information using DFDC dataset arXiv:2006.07397. Thus, we unravel that our approach captures local spatio-temporal relations and inconsistencies in the deepfake videos while existing sequence encoders are indifferent to it. Through large scale experiments conducted on the FaceForensics++ arXiv:1901.08971 and Deeper Forensics arXiv:2001.03024 datasets, we show that our approach outperforms existing methods in terms of generalization capabilities.


Key findings
Spatio-temporal models significantly outperform image-based and sequential models in generalizing to unseen deepfake methods. 3D CNNs effectively capture shared attributes between different deepfake techniques, while image-based methods learn method-specific features. The study reveals that sequential encoders are insensitive to local spatio-temporal inconsistencies, unlike 3D CNNs.
Approach
The authors propose using 3D Convolutional Neural Networks (CNNs) to extract spatio-temporal features from deepfake videos. This approach captures local spatio-temporal inconsistencies and shared attributes between different deepfake methods, improving generalization compared to methods relying solely on spatial or sequential features.
Datasets
FaceForensics++, Deepfake Detection Challenge (DFDC), Deeper Forensics
Model(s)
R3D-18, I3D-RGB, XceptionNet, EfficientNet-B3, LSTM, bi-directional GRU
Author countries
The Netherlands