DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

Authors: Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min

Published: 2025-06-03 15:45:41+00:00

AI Summary

This paper introduces DFBench, a large-scale benchmark for deepfake image detection featuring 540,000 diverse images from 12 state-of-the-art generative models and a bidirectional evaluation protocol. Based on DFBench, the authors propose MoA-DF, a Mixture of Agents for DeepFake detection that combines probabilities from multiple Large Multimodal Models (LMMs) to achieve state-of-the-art performance.

Abstract

With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \\textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \\textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.


Key findings
Generative models exhibit increasing realism, posing significant challenges for detection and limiting the generalization of traditional deepfake detection methods. Large Multimodal Models (LMMs) demonstrate strong zero-shot capabilities for deepfake image detection, outperforming many conventional detectors. The proposed MoA-DF, which leverages an ensemble of multiple LMMs, achieves state-of-the-art performance in deepfake image detection.
Approach
The paper introduces DFBench, a large-scale benchmark dataset with 540,000 diverse images (real, AI-edited, AI-generated by 12 SOTA models) and a bidirectional evaluation protocol. For deepfake detection, they propose MoA-DF, a method that aggregates probabilistic outputs from multiple Large Multimodal Models (LMMs) like Qwen2.5, InternVL2.5, and InternVL3 to enhance robustness and accuracy.
Datasets
DFBench, LIVE, CSIQ, TID2013, KADID-10k, CLIVE, KonIQ-10k, Flickr8k, EPAIQA-15K
Model(s)
MoA-DF (ensemble of Qwen2.5, InternVL2.5, InternVL3), CnnSpott, AntifakePrompt, Gram-Net, UnivFD, LGrad, Llava-one-vision, DeepSeekVL, LLaVA-1.5, mPLUG-Owl3, Qwen2.5-VL, CogAgent, InternVL2.5, InternVL3, InternLM-XComposer2.5, LLaVA-NeXT, Llama3.2-Vision, Qwen2-VL, Gemini1.5-pro, Grok2 Vision
Author countries
China