BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Authors: Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

Published: 2025-05-19 02:06:43+00:00

AI Summary

This paper introduces GenBuster-200K, a large-scale AI-generated video dataset with 200,000 high-resolution clips, and BusterX, a novel framework for explainable AI-generated video detection using a multimodal large language model (MLLM) and reinforcement learning. BusterX achieves state-of-the-art performance and provides detailed explanations for its decisions.

Abstract

Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose textbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce textbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {it textbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {it textbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.


Key findings
BusterX outperforms state-of-the-art methods on multiple datasets, achieving at least 3.5% higher accuracy on GenBuster-200K, 5.5% on the Closed Benchmark, and 12% on FakeAVCeleb. Ablation studies confirm the effectiveness of the reinforcement learning strategy in enhancing both detection accuracy and explanation quality.
Approach
BusterX leverages a multimodal large language model (MLLM) for video authenticity determination. It employs reinforcement learning to train the MLLM for step-by-step reasoning, generating human-understandable explanations for its classifications of videos as real or fake.
Datasets
GenBuster-200K (200,000 high-resolution video clips, including real and AI-generated videos from various sources and generation methods), FakeAVCeleb, Closed Benchmark (a challenging subset of videos generated by commercial models not seen during training)
Model(s)
Qwen2.5-VL-7B-Instruct (MLLM), 3D ResNet, 3D ResNeXt, Vivit, VideoMAE, DeMamba (baselines)
Author countries
UK, China