TALL: Thumbnail Layout for Deepfake Video Detection

View on arXiv ← Back to list

Authors: Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He

Published: 2023-07-14 17:27:22+00:00

AI Summary

This paper proposes TALL, a Thumbnail Layout strategy for efficient deepfake video detection. TALL transforms video clips into a predefined layout, preserving spatial and temporal dependencies, and is model-agnostic. Integrated with Swin Transformer, TALL-Swin achieves state-of-the-art performance.

Abstract

The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$%$ AUC on the challenging cross-dataset task, FaceForensics++ $to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.

Key findings

TALL-Swin achieves a 90.79% AUC on the cross-dataset task FaceForensics++ to Celeb-DF, outperforming existing methods. The approach demonstrates strong generalization across datasets and robustness to common video corruptions.

Approach

TALL rearranges consecutive frames of a video clip into a thumbnail layout, preserving spatio-temporal information. This thumbnail is then fed into a model (like Swin Transformer), creating TALL-Swin, which efficiently detects inconsistencies in deepfake videos.

Datasets

FaceForensics++, Celeb-DF, DFDC, DeeperForensics

Model(s)

Swin Transformer (TALL-Swin), ResNet-50, EfficientNet-B4, ViT-B

Author countries

China

← Previous