Detection of GAN-synthesized street videos

Authors: Omran Alamayreh, Mauro Barni

Published: 2021-09-10 16:59:15+00:00

AI Summary

This paper investigates the detection of AI-generated street videos, a largely unexplored area in deepfake detection. It proposes a simple frame-based detector using a CNN architecture that achieves high accuracy on state-of-the-art DeepStreets videos, even under video compression.

Abstract

Research on the detection of AI-generated videos has focused almost exclusively on face videos, usually referred to as deepfakes. Manipulations like face swapping, face reenactment and expression manipulation have been the subject of an intense research with the development of a number of efficient tools to distinguish artificial videos from genuine ones. Much less attention has been paid to the detection of artificial non-facial videos. Yet, new tools for the generation of such kind of videos are being developed at a fast pace and will soon reach the quality level of deepfake videos. The goal of this paper is to investigate the detectability of a new kind of AI-generated videos framing driving street sequences (here referred to as DeepStreets videos), which, by their nature, can not be analysed with the same tools used for facial deepfakes. Specifically, we present a simple frame-based detector, achieving very good performance on state-of-the-art DeepStreets videos generated by the Vid2vid architecture. Noticeably, the detector retains very good performance on compressed videos, even when the compression level used during training does not match that used for the test videos.


Key findings
The proposed detector achieved near-perfect accuracy on the DeepStreets dataset, showing robustness to video compression even with mismatched training and testing conditions. However, cross-dataset analysis revealed that performance varied significantly depending on the training and testing datasets used, highlighting the importance of dataset consistency.
Approach
The researchers propose a frame-based video forgery detector using a CNN (specifically, XceptionNet) architecture. The model processes individual frames without focusing on specific regions, classifying each frame as real or fake. The final classification is based on the aggregated results.
Datasets
DeepStreets dataset, comprised of three subsets: Cityvid, Citywcvid, and Kittivid. These subsets are generated using Vid2vid and Wc-vid2vid architectures, fed with semantic segmentation masks from Cityscapes and Kitti datasets. Videos were also compressed at different quality levels (HQ and LQ).
Model(s)
XceptionNet CNN architecture.
Author countries
Italy