Image Generation Models: A Technical History

Authors: Rouzbeh Shirvani

Published: 2026-03-08 04:11:01+00:00

AI Summary

This paper provides a comprehensive technical survey of breakthrough image and video generation models, encompassing variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. It details their underlying objectives, architectural components, training steps, and limitations, while also discussing recent advancements in high-quality video generation. Furthermore, the survey addresses the growing importance of responsible deployment, including deepfake risks, detection, artifacts, and watermarking.

Abstract

Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.


Key findings
The field of image generation has evolved dramatically over the last decade, transitioning from low-quality VAEs and early GANs to highly sophisticated diffusion and transformer-based models capable of producing photorealistic images and coherent videos with fine-grained control. This progress is marked by continuous architectural innovations and refined training objectives. However, the increasing capabilities of these models introduce significant societal challenges such as deepfake proliferation, copyright issues, and biases, highlighting the critical need for advanced detection methods and responsible deployment practices.
Approach
The paper conducts a comprehensive survey by providing detailed technical walkthroughs of various image and video generation models, including their architectural building blocks, training algorithms, optimization techniques, and common limitations. It tracks the chronological evolution of these models and concludes by examining recent developments in video generation and the critical societal implications like deepfake risks and detection.
Datasets
ImageNet, CIFAR-10, FFHQ, MS-COCO, UCF-101, FaceForensics++, DeepFake Detection Challenge (DFDC) Dataset, UADFV, DeepfakeTIMIT, CycleGAN, StarGAN, Places2
Model(s)
UNKNOWN
Author countries
UNKNOWN