Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang, Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu

Published: 2025-03-25 21:47:29+00:00

AI Summary

This research benchmarks 12 multi-modal LLMs against traditional deepfake detection methods on various datasets. The study finds that the best-performing LLMs achieve competitive results, even surpassing traditional methods on out-of-distribution datasets, while others perform poorly. Prompt tuning is employed to enhance performance.

Abstract

Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.


Key findings
Top-performing multi-modal LLMs (OpenAI models) showed competitive performance and good generalization, even outperforming traditional methods on some datasets. Newer model versions and reasoning capabilities did not consistently improve performance. Model size positively correlated with performance in some cases.
Approach
The researchers evaluate 12 state-of-the-art multi-modal LLMs' ability to detect deepfakes in images using prompt tuning. They compare their performance to traditional deepfake detection methods using several datasets, including real-world deepfakes. The models' reasoning pathways are analyzed to understand their decision-making process.
Datasets
CDF (Celeb-DeepFake Dataset), FF+ (FaceForensics++ Dataset), RWDF (Real-World DeepFake Dataset)
Model(s)
OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet
Author countries
USA