Human Action CLIPs: Detecting AI-generated Human Motion

Authors: Matyas Bohacek, Hany Farid

Published: 2024-11-30 16:20:58+00:00

AI Summary

This paper introduces a robust technique for distinguishing real from AI-generated human motion in videos using multi-modal semantic embeddings, specifically CLIP embeddings. The method is shown to be resilient to common video manipulations like resolution and compression attacks and generalizes well to unseen AI models.

Abstract

AI-generated video generation continues its journey through the uncanny valley to produce content that is increasingly perceptually indistinguishable from reality. To better protect individuals, organizations, and societies from its malicious applications, we describe an effective and robust technique for distinguishing real from AI-generated human motion using multi-modal semantic embeddings. Our method is robust to the types of laundering that typically confound more low- to mid-level approaches, including resolution and compression attacks. This method is evaluated against DeepAction, a custom-built, open-sourced dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage. The dataset is available under an academic license at https://www.huggingface.co/datasets/faridlab/deepaction_v1.


Key findings
The fine-tuned CLIP model with an RBF kernel achieved the highest accuracy (99.1%) in distinguishing real from AI-generated videos. The unsupervised 'frame-to-prompt' method, requiring no training, showed surprisingly high accuracy (96.2%). The approach demonstrated robustness to resolution and compression attacks and good generalizability to unseen AI models.
Approach
The authors use several pre-trained and a fine-tuned CLIP model to extract multi-modal embeddings from video frames. These embeddings are then classified using support vector machines (SVMs) or an unsupervised cosine similarity approach comparing frame embeddings to text prompts ('real image' vs. 'fake image').
Datasets
DeepAction dataset: a custom-built dataset of real and AI-generated video clips depicting human actions from seven text-to-video AI models (BD AnimateDiff, CogVideoX-5B, Lumiere2, RunwayML Gen3, Stable Diffusion Txt2Img+Img2Vid, Veo, and VideoPoet) and matching real footage from Pexels.
Model(s)
CLIP, SigLIP, JinaCLIP, and a fine-tuned CLIP (FT-CLIP) model; Support Vector Machines (SVM) with linear, RBF, and polynomial kernels.
Author countries
United States