Circumventing Concept Erasure Methods For Text-to-Image Generative Models

Authors: Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, Chinmay Hegde

Published: 2023-08-03 02:34:01+00:00

AI Summary

This paper demonstrates that seven recently proposed concept erasure methods for text-to-image generative models are ineffective at fully removing targeted concepts. The authors achieve this by devising a concept inversion algorithm that generates special word embeddings capable of retrieving erased concepts from the modified models without altering their weights.

Abstract

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to erase sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve erased concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.


Key findings
All seven examined concept erasure methods were successfully circumvented using the proposed concept inversion technique. The results highlight the brittleness of post-hoc erasure methods and suggest a need for fundamentally new approaches to building and evaluating safe generative models. The learned embeddings were also transferable to the original, un-modified model.
Approach
The researchers leverage concept inversion (CI), a technique that learns special word embeddings to recover erased concepts. These embeddings are used as input to the modified models, bypassing the intended concept erasure and generating images containing the supposedly removed concepts.
Datasets
Stable Diffusion 1.4, Imagenette, I2P, LAION-400M, LAION-5B, MNIST, Google Images
Model(s)
Stable Diffusion (various versions, including fine-tuned versions from different concept erasure methods), ResNet-50, CLIP, NudeNet, GIPHY celebrity detector
Author countries
USA, Israel