MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

View on arXiv ← Back to list

Authors: Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah

Published: 2025-05-16 10:42:30+00:00

AI Summary

This paper introduces MAVOS-DD, the first large-scale open-set benchmark for multilingual audio-video deepfake detection. It contains over 250 hours of real and fake videos across eight languages, generated using seven different deepfake methods, and evaluates state-of-the-art detectors under challenging open-set scenarios, revealing significant performance degradation.

Abstract

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.

Key findings

State-of-the-art deepfake detectors show significant performance drops in open-set scenarios (unseen models and languages) of the MAVOS-DD benchmark. Fine-tuned models perform better than pre-trained models, but still struggle with generalization. Multimodal models outperform unimodal models on this dataset.

Approach

MAVOS-DD addresses the limitations of existing deepfake datasets by providing a large-scale, multilingual, open-set benchmark. It evaluates the performance of existing deepfake detectors across various open-set scenarios, including unseen deepfake generation models and languages.

Datasets

MAVOS-DD (created by the authors), YouTube videos, FLUX generated portraits, FFHQ, CelebAMask-HQ

Model(s)

AVFF, MRDF, TALL

Author countries

Romania, UAE, Sweden, USA

← Previous