Beyond Real versus Fake Towards Intent-Aware Video Analysis

Authors: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva

Published: 2025-11-27 13:44:06+00:00

AI Summary

This paper introduces IntentHQ, a novel benchmark dataset for human-centered intent analysis in videos, moving beyond traditional deepfake detection to contextual understanding. It comprises 5168 videos annotated with 23 fine-grained intent categories. The authors propose a self-supervised multi-modality approach that integrates spatio-temporal video features, audio processing, and text analysis to recognize intent.

Abstract

The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including Financial fraud, Indirect marketing, Political propaganda, as well as Fear mongering. We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.


Key findings
The proposed self-supervised pretraining with fine-tuning achieved the best accuracy of 52.5% for 23 intent classes, significantly outperforming supervised-only baselines and standard video classification models. Ablation studies revealed that video and text modalities are the most impactful, while audio contributes less in isolation. Filtering for English-only and clear audio videos improved model performance.
Approach
The proposed methodology involves a self-supervised framework for three-way contrastive alignment of video, audio, and text features, followed by supervised fine-tuning. It utilizes modality-specific encoders (CLIP ViT-L/14 for video, WavLM for audio, CLIP Text Encoder for text) and a lightweight MLP classifier to predict one of 23 intent categories.
Datasets
IntentHQ (5168 videos, 23 intent categories, multimodal: video, audio, transcript)
Model(s)
CLIP ViT-L/14 (video encoder), WavLM (audio encoder), CLIP Text Encoder (text encoder), Meta Llama 3 8b (for text augmentation), MLP classifier. (Other models like VideoLLaMA3, VifiCLIP, ONE-PEACE, SigLIP, ViT-B/32, HuBERT, Qwen, VideoMAEv2, Data2Vec, BERT-Large were used for baselines).
Author countries
India, France, Germany