Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

Published: 2025-03-14 15:42:42+00:00

Comment: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability as Best Student Paper

AI Summary

This paper investigates Typographic Visual Prompt Injection (TVPI) threats in cross-modality generation models, specifically Large Vision Language Models (LVLMs) and Image-to-Image Generation Models (I2I GMs). The authors introduce a comprehensive TVPI Dataset and thoroughly evaluate the security risks on various open-source and closed-source models under diverse visual prompt configurations. The study deepens the understanding of TVPI threats, revealing significant vulnerabilities and the limitations of simple defense mechanisms.

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.


Key findings
Typographic Visual Prompt Injection significantly influences outputs from both open-source and closed-source LVLMs and I2I GMs, causing disruptive and semantically aligned generations. Larger text sizes, higher opacity, and specific visual prompt positions generally lead to stronger attack effects. A simple defense method of instructing models to ignore text in images showed only partial effectiveness in reducing attack success rates and minimal impact on image generation quality.
Approach
The authors propose the Typographic Visual Prompts Injection Dataset, comprising VLP and I2I subtype datasets, to systematically analyze TVPI threats. They evaluate the performance impact of TVPI by injecting typographic visual prompts into input images of various open-source and closed-source LVLMs and I2I GMs, using different text factors (size, opacity, position) and target semantics (protective, harmful, bias, neutral).
Datasets
Typographic Visual Prompts Injection (TVPI) Dataset (proposed), ImageNet, Visual7W, TallyQA, MSCOCO, CelebA-HQ
Model(s)
UNKNOWN
Author countries
Hong Kong, China, United Kingdom