Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

View on arXiv ← Back to list

Authors: Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, Junfeng Yang

Published: 2024-06-13 21:52:49+00:00

AI Summary

This paper introduces DIVID, a novel framework for detecting videos synthesized by state-of-the-art generative models like Stable Video Diffusion. It addresses the limitation of existing detectors that struggle with temporal features in videos by using a CNN+LSTM architecture trained on both RGB frames and Diffusion Reconstruction Error (DIRE) values.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.

Key findings

DIVID achieves 93.7% detection accuracy on in-domain videos and improves out-of-domain accuracy by up to 16 points compared to baselines. The use of both RGB frames and DIRE values with a CNN+LSTM architecture significantly improves the robustness and generalizability of the detector.

Approach

DIVID calculates the Diffusion Reconstruction Error (DIRE) for each video frame using a pre-trained diffusion model. A CNN+LSTM architecture is then trained on both the DIRE values and the original RGB frames to capture both spatial and temporal information, improving detection accuracy.

Datasets

A new benchmark video dataset containing in-domain videos generated using Stable Video Diffusion and out-of-domain videos from Pika, Runway Gen-2, and SORA. Real videos are sourced from ImageNet Video Visual Relation Detection (VidVRD) and YouTube.

Model(s)

ResNet50 (CNN) + LSTM

Author countries

USA

← Previous