Human Action CLIPs: Detecting AI-generated Human Motion

Authors: Matyas Bohacek, Hany Farid

Published: 2024-11-30 16:20:58+00:00

Journal Ref: Workshop on Deepfake Detection, Localization and Interpretability @ IJCAI 2025

AI Summary

This paper presents an effective and robust technique for detecting AI-generated human motion in videos, leveraging multi-modal semantic embeddings. The method demonstrates strong performance against seven text-to-video AI models and is resilient to common laundering attacks like resolution and compression changes. A new open-sourced dataset, DeepAction, is introduced for evaluating AI-generated human motion detection.

Abstract

AI-generated video generation continues its journey through the uncanny valley to produce content that is increasingly perceptually indistinguishable from reality. To better protect individuals, organizations, and societies from its malicious applications, we describe an effective and robust technique for distinguishing real from AI-generated human motion using multi-modal semantic embeddings. Our method is robust to the types of laundering that typically confound more low- to mid-level approaches, including resolution and compression attacks. This method is evaluated against DeepAction, a custom-built, open-sourced dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage. The dataset is available under an academic license at https://www.huggingface.co/datasets/faridlab/deepaction_v1.


Key findings
The fine-tuned CLIP (FT-CLIP) embedding with an RBF kernel SVM achieved the highest two-class video-level accuracy of 99.1-99.2%. The method proved robust against resolution and compression laundering, maintaining high accuracy. It also exhibited good generalizability to previously unseen synthesis models and non-human motion, with the unsupervised frame-to-prompt approach showing surprisingly strong performance without explicit training.
Approach
The authors employ multi-modal semantic embeddings from models like CLIP, SigLIP, JinaCLIP, and a custom fine-tuned CLIP (FT-CLIP) to represent video frames. These embeddings are then used with various classification schemes, including supervised Support Vector Machines (SVMs) (two-class and multi-class) and an unsupervised frame-to-prompt cosine similarity approach, to distinguish real from AI-generated videos.
Datasets
DeepAction (custom-built, open-sourced, containing 3,100 AI-generated video clips from seven text-to-video AI models and 100 real videos from Pexels), DeepSpeak Dataset (for talking heads generalization), GTA-Human dataset (for CGI generalization).
Model(s)
CLIP (ViT-B/32), SigLIP (patch16-224), JinaCLIP (v1), custom Fine-Tuned CLIP (FT-CLIP), Support Vector Machines (SVMs) with linear, RBF, and polynomial kernels.
Author countries
USA