DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models

Authors: Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, Duen Horng Chau

Published: 2022-10-26 17:54:20+00:00

AI Summary

This paper introduces DiffusionDB, a large-scale dataset of 14 million text-to-image pairs generated by Stable Diffusion. It provides valuable resources for researching prompt engineering, understanding generative models, and detecting deepfakes.

Abstract

With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts or what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset totaling 6.5TB, containing 14 million images generated by Stable Diffusion, 1.8 million unique prompts, and hyperparameters specified by real users. We analyze the syntactic and semantic characteristics of prompts. We pinpoint specific hyperparameter values and prompt styles that can lead to model errors and present evidence of potentially harmful model usage, such as the generation of misinformation. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models. DiffusionDB is publicly available at: https://poloclub.github.io/diffusiondb.


Key findings
Analysis revealed common prompt patterns and model errors, particularly concerning hyperparameter settings and prompt styles. The study also uncovered evidence of potentially harmful model usage, such as misinformation and nonconsensual pornography generation.
Approach
The dataset was constructed by scraping user-generated images and their associated prompts and hyperparameters from the Stable Diffusion public Discord server. NSFW content was identified using pre-trained classifiers, and the data is organized for efficient access and use.
Datasets
Stable Diffusion Discord server data (14 million images, 1.8 million unique prompts)
Model(s)
Stable Diffusion, CLIP (for embedding extraction), multilingual toxicity prediction model, EfficientNet classifier (for NSFW detection)
Author countries
USA