On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Authors: Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian

Published: 2025-07-30 05:41:29+00:00

AI Summary

This paper reveals a critical vulnerability in Vision-Language Models (VLMs) used for DeepFake detection and image captioning. By applying subtle, structured perturbations in the frequency domain of images, the authors demonstrate that VLMs' judgments are easily manipulated, highlighting their reliance on low-level image features rather than semantic content.

Abstract

Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.


Key findings
Visually imperceptible frequency-based perturbations reliably manipulate VLM predictions for both image authenticity and captioning tasks. VLMs exhibit fragility under these attacks, suggesting a dependence on low-level image cues rather than semantic understanding. These vulnerabilities generalize across various state-of-the-art VLMs, regardless of model size or architecture.
Approach
The authors design targeted image transformations in the frequency domain to systematically adjust VLM outputs. These visually imperceptible perturbations are applied iteratively, guided by the VLM's responses, to manipulate its predictions for both image authenticity and caption generation.
Datasets
Stable Diffusion v3.5-Large generated images (fantasy, outdoor, ImageNet scenes), Stable Diffusion v1.4 generated ImageNet-1K, CIFAKE, Google Conceptual Captions (GCC), COCO-2017, Flickr30k, real ImageNet, and CIFAR-10.
Model(s)
Qwen2-VL-7B-Instruct, Qwen2-VL-2B, Qwen2.5-VL-3B, BLIP2-VL-2.7B, BLIP2-VL-6.7B
Author countries
Australia, Australia, Australia, Australia, Australia