Exploring Spatial-Temporal Features for Deepfake Detection and Localization

Authors: Wu Haiwei, Zhou Jiantao, Zhang Shile, Tian Jinyu

Published: 2022-10-28 03:38:49+00:00

AI Summary

This paper presents ST-DDL, a network for deepfake detection and localization that leverages both spatial and temporal features. It introduces a novel Anchor-Mesh Motion (AMM) algorithm for extracting fine-grained motion features and a Fusion Attention (FA) module to effectively integrate spatial and temporal information.

Abstract

With the continuous research on Deepfake forensics, recent studies have attempted to provide the fine-grained localization of forgeries, in addition to the coarse classification at the video-level. However, the detection and localization performance of existing Deepfake forensic methods still have plenty of room for further improvement. In this work, we propose a Spatial-Temporal Deepfake Detection and Localization (ST-DDL) network that simultaneously explores spatial and temporal features for detecting and localizing forged regions. Specifically, we design a new Anchor-Mesh Motion (AMM) algorithm to extract temporal (motion) features by modeling the precise geometric movements of the facial micro-expression. Compared with traditional motion extraction methods (e.g., optical flow) designed to simulate large-moving objects, our proposed AMM could better capture the small-displacement facial features. The temporal features and the spatial features are then fused in a Fusion Attention (FA) module based on a Transformer architecture for the eventual Deepfake forensic tasks. The superiority of our ST-DDL network is verified by experimental comparisons with several state-of-the-art competitors, in terms of both video- and pixel-level detection and localization performance. Furthermore, to impel the future development of Deepfake forensics, we build a public forgery dataset consisting of 6000 videos, with many new features such as using widely-used commercial software (e.g., After Effects) for the production, providing online social networks transmitted versions, and splicing multi-source videos. The source code and dataset are available at https://github.com/HighwayWu/ST-DDL.


Key findings
ST-DDL outperforms state-of-the-art methods on video-level detection and pixel-level localization tasks across multiple datasets. The new ManualFake dataset, which includes videos generated using commercial software and transmitted through social media, demonstrates higher video quality than existing datasets. The method shows some robustness degradation against OSN transmission.
Approach
ST-DDL uses HRNet encoders for spatial (RGB) and temporal (motion) features extracted via the proposed AMM algorithm. These features are fused using a Fusion Attention (FA) module based on a Transformer architecture, followed by a decoder for localization and an MLP for classification.
Datasets
FF++, DFD, FFIW, and a newly created public dataset, ManualFake (6000 videos).
Model(s)
ST-DDL network, which utilizes HRNet as encoders, a custom AMM algorithm for motion extraction, and a Fusion Attention (FA) module.
Author countries
Macau, China