Delving into Sequential Patches for Deepfake Detection

View on arXiv ← Back to list

Authors: Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan, Youjian Zhao

Published: 2022-07-06 16:46:30+00:00

AI Summary

The paper introduces the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which leverages a local-to-global learning protocol focusing on temporal information within local sequences. It uses a Local Sequence Transformer (LST) to model temporal consistency in restricted spatial regions, enhancing low-level information with 3D filters, and achieves final classification through global contrastive learning.

Abstract

Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions. As a result, researchers have been devoted to deepfake detection. Previous studies have identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods, however, they still suffer from robustness problem against post-processings. In this work, we propose the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which adopts a local-to-global learning protocol with a particular focus on the valuable temporal information within local sequences. Specifically, we propose a Local Sequence Transformer (LST), which models the temporal consistency on sequences of restricted spatial regions, where low-level information is hierarchically enhanced with shallow layers of learned 3D filters. Based on the local temporal embeddings, we then achieve the final classification in a global contrastive way. Extensive experiments on popular datasets validate that our approach effectively spots local forgery cues and achieves state-of-the-art performance.

Key findings

The LTTD framework effectively identifies local forgery cues and achieves state-of-the-art performance in deepfake detection. The approach demonstrates improved generalizability and robustness against post-processing techniques compared to previous methods.

Approach

LTTD divides video clips into local patches and uses a Local Sequence Transformer (LST) to model the temporal consistency within these patches. Low-level information is enhanced with 3D filters. A Cross-Patch Inconsistency loss and Cross-Patch Aggregation module aggregate local information for global prediction.

Datasets

UNKNOWN

Model(s)

Local Sequence Transformer (LST), Transformer, 3D convolutional layers

Author countries

China

← Previous