Beyond Deepfake Images: Detecting AI-Generated Videos

Authors: Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, Matthew C. Stamm

Published: 2024-04-24 16:19:31+00:00

AI Summary

This paper demonstrates that existing synthetic image detectors fail to detect AI-generated videos due to substantially different forensic traces left by video generators. The authors show that these video traces can be learned and used for reliable video detection and source attribution, even after compression, and that few-shot learning enables accurate detection of videos from new generators.

Abstract

Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.


Key findings
Synthetic image detectors perform poorly on synthetic videos; CNNs can be effectively trained to detect synthetic videos and attribute them to their source, even after H.264 recompression; few-shot learning allows adaptation to new video generators with high accuracy.
Approach
The researchers trained Convolutional Neural Networks (CNNs) on a new dataset of real and synthetic videos from various generators to learn the distinct forensic traces in synthetic videos. They evaluated the models' performance on video detection and source attribution, considering the impact of H.264 compression and explored both zero-shot and few-shot transfer learning scenarios.
Datasets
Moments in Time (MiT), Video-ACID, and a new publicly available dataset of synthetic videos from Luma, VideoCrafter-v1, CogVideo, and Stable Video Diffusion. An out-of-distribution test set included Sora, Pika, and VideoCrafter-v2.
Model(s)
ResNet-50, ResNet-34, VGG-16, Xception, DenseNet, MISLnet, DIF, and Swin-Transformer. The MISLnet architecture performed best.
Author countries
USA