MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Authors: Mengxue Hu, Yunfeng Diao, Changtao Miao, Jianshu Li, Zhe Li, Joey Tianyi Zhou

Published: 2025-11-29 05:59:38+00:00

Comment: 7 pages,2 figures

AI Summary

This paper introduces MVAD, the first comprehensive multimodal video-audio dataset designed for detecting AI-generated content. It addresses limitations of existing datasets by offering genuine multimodality with three realistic forgery patterns, high perceptual quality achieved through diverse state-of-the-art generative models, and extensive diversity across visual styles, content categories, and data types.

Abstract

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.


Key findings
MVAD establishes itself as the first comprehensive dataset for general AI-generated multimodal video-audio content detection, featuring over 215,000 samples across diverse forgery types and categories. Comparative evaluations show that MVAD's generated samples achieve significantly superior quality across various metrics (e.g., aesthetic quality, motion smoothness) compared to existing unimodal and deepfake-focused datasets. This dataset aims to bridge a critical gap and accelerate the development of robust multimodal AIGC detection systems.
Approach
The authors created MVAD by collecting raw real video and audio data, then generating forged multimodal content using over twenty state-of-the-art generative models. This process simulates three real-world forgery patterns: Fake Video-Fake Audio, Fake Video-Real Audio, and Real Video-Fake Audio. The generated samples are rigorously evaluated using automated metrics, Large Multimodal Models (LMMs), and human expert verification to ensure high quality and diversity.
Datasets
MVAD, Ugc-VideoCaptioner, HarmonySet, TalkVid, Msvd, OpenVid-1M, InternVid-10M, MSR-VTT
Model(s)
UNKNOWN
Author countries
China, Singapore