MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Authors: Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah

Published: 2025-05-16 10:42:30+00:00

Comment: 15 pages

AI Summary

This paper introduces MAVOS-DD, the first large-scale open-set benchmark for multilingual audio-video deepfake detection. The dataset comprises over 250 hours of real and fake videos across eight languages, generated by seven distinct deepfake models, and is structured into challenging open-set evaluation scenarios. Experiments reveal that state-of-the-art deepfake detectors suffer significant performance degradation when tested in these open-set conditions, highlighting their current limitations in generalization.

Abstract

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.


Key findings
State-of-the-art deepfake detectors, even when fine-tuned, experience significant performance drops in open-set scenarios involving unseen generative models or languages. Pre-trained models performed close to random chance on the MAVOS-DD dataset, while multimodal approaches (AVFF, MRDF) demonstrated a noticeable advantage over unimodal video-only methods (TALL). These results underscore the critical need for more robust and generalizable deepfake detection techniques.
Approach
The authors solve the problem of evaluating deepfake detection models by creating a new, comprehensive benchmark named MAVOS-DD. This benchmark features diverse open-set evaluation setups that expose models to unseen deepfake generation methods and languages during testing, simulating real-world generalization challenges. They then evaluate existing state-of-the-art deepfake detectors on these challenging scenarios.
Datasets
MAVOS-DD (primary), FLUX, FFHQ, CelebAMask-HQ (for identity sources in generation). Mentioned for comparison: FaceForensics++, DFDC, DeeperForensics, ForgeryNet, Celeb-DF, WildDeepfake, FakeAVCeleb, DeepSpeak, Deepfake-Eval-2024.
Model(s)
AVFF, MRDF, TALL (specifically TALL-Swin).
Author countries
Romania, United Arab Emirates, Sweden, United States