Text-image guided Diffusion Model for generating Deepfake celebrity interactions

Authors: Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, Ajmal Mian

Published: 2023-09-26 08:24:37+00:00

AI Summary

This paper proposes a novel method for generating high-quality deepfake images using a modified Stable Diffusion model. The method incorporates both text and image prompts, improving control and realism, particularly in scenes with multiple people, addressing limitations in existing diffusion models.

Abstract

Deepfake images are fast becoming a serious concern due to their realism. Diffusion models have recently demonstrated highly realistic visual content generation, which makes them an excellent potential tool for Deepfake generation. To curb their exploitation for Deepfakes, it is imperative to first explore the extent to which diffusion models can be used to generate realistic content that is controllable with convenient prompts. This paper devises and explores a novel method in that regard. Our technique alters the popular stable diffusion model to generate a controllable high-quality Deepfake image with text and image prompts. In addition, the original stable model lacks severely in generating quality images that contain multiple persons. The modified diffusion model is able to address this problem, it add input anchor image's latent at the beginning of inferencing rather than Gaussian random latent as input. Hence, we focus on generating forged content for celebrity interactions, which may be used to spread rumors. We also apply Dreambooth to enhance the realism of our fake images. Dreambooth trains the pairing of center words and specific features to produce more refined and personalized output images. Our results show that with the devised scheme, it is possible to create fake visual content with alarming realism, such that the content can serve as believable evidence of meetings between powerful political figures.


Key findings
The modified model significantly outperforms the original Stable Diffusion model in generating realistic deepfake images, especially those depicting interactions between multiple people. Subjective evaluations show a marked improvement in the convincingness of generated images.
Approach
The authors modify the Stable Diffusion model by adding an image encoder and a two-stream U-Net to process both text and image prompts. They also utilize Dreambooth to enhance realism and control over generated images.
Datasets
Images of celebrities obtained from Google searches; a small number of images (3-5) per celebrity were used for Dreambooth training.
Model(s)
Modified Stable Diffusion model with a two-stream U-Net and Dreambooth fine-tuning.
Author countries
Australia