LLMs Are Not Yet Ready for Deepfake Image Detection

Authors: Shahroz Tariq, David Nguyen, M. A. P. Chamikara, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

Published: 2025-06-12 08:27:24+00:00

AI Summary

This research evaluates the zero-shot deepfake image detection capabilities of four prominent vision-language models (VLMs): ChatGPT, Claude, Gemini, and Grok. The study finds that while VLMs can offer interpretable explanations and detect surface-level anomalies, they are not yet reliable enough for standalone deepfake detection due to significant limitations in accuracy and biases.

Abstract

The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.


Key findings
VLMs demonstrated inconsistencies in performance across different deepfake types and image styles. They are prone to misclassifications due to biases towards specific aesthetic features and surface-level reasoning. While lacking in accuracy for autonomous detection, VLMs offer valuable interpretability and could augment human expertise in deepfake detection workflows.
Approach
The researchers conducted a zero-shot evaluation of four VLMs on a benchmark dataset of real and manipulated images encompassing faceswap, reenactment, and synthetic generation deepfakes. They assessed classification accuracy, reasoning depth, and identified failure modes like overemphasis on stylistic elements.
Datasets
A meticulously assembled benchmark dataset comprising authentic and manipulated images from diverse sources, including FF++, DFDC, CelebDF, Lexica.art, Krea.ai, and Civitai.com, along with images from Unsplash and Getty Images.
Model(s)
ChatGPT (GPT-4o), Claude (Sonnet 4), Gemini (2.5 Flash), and Grok 3.
Author countries
Australia