Undercover Deepfakes: Detecting Fake Segments in Videos

View on arXiv ← Back to list

Authors: Sanjay Saha, Rashindrie Perera, Sachith Seneviratne, Tamasha Malepathirana, Sanka Rasnayaka, Deshani Geethika, Terence Sim, Saman Halgamuge

Published: 2023-05-11 04:43:10+00:00

AI Summary

This paper introduces a novel deepfake detection method that performs deepfake prediction at both frame and video levels, addressing the under-explored problem of detecting subtly altered video segments. The approach utilizes a Vision Transformer for spatial features and a Timeseries Transformer for temporal features, achieving excellent results on a newly created benchmark dataset.

Abstract

The recent renaissance in generative models, driven primarily by the advent of diffusion models and iterative improvement in GAN methods, has enabled many creative applications. However, each advancement is also accompanied by a rise in the potential for misuse. In the arena of the deepfake generation, this is a key societal issue. In particular, the ability to modify segments of videos using such generative techniques creates a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth. This paradigm has been under-explored by the current deepfake detection methods in the academic literature. In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels. To facilitate testing our method, we prepared a new benchmark dataset where videos have both real and fake frame sequences with very subtle transitions. We provide a benchmark on the proposed dataset with our detection method which utilizes the Vision Transformer based on Scaling and Shifting to learn spatial features, and a Timeseries Transformer to learn temporal features of the videos to help facilitate the interpretation of possible deepfakes. Extensive experiments on a variety of deepfake generation methods show excellent results by the proposed method on temporal segmentation and classical video-level predictions as well. In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes. All experiments can be reproduced at: github.com/rgb91/temporal-deepfake-segmentation.

Key findings

The proposed method outperforms state-of-the-art methods in temporal deepfake segmentation, achieving high accuracy even with short fake segments. It also performs competitively in traditional video-level deepfake detection and generalizes well to unseen datasets.

Approach

The authors propose a two-stage method. First, a Vision Transformer (ViT) extracts frame-level features, which are then processed by a Timeseries Transformer (TsT) to learn temporal features for classification. A smoothing technique based on majority voting improves the accuracy of frame-level predictions.

Datasets

A new benchmark dataset created by the authors, based on FaceForensics++, containing videos with both real and fake frame sequences with subtle transitions; FaceForensics++, CelebDF, DFDC, WildDeepFakes.

Model(s)

Vision Transformer (ViT-B/16) with Scaling and Shifting (SSF) for spatial feature extraction, Timeseries Transformer (TsT) for temporal feature learning.

Author countries

Singapore, Australia

← Previous