Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile

Authors: Seokjun Lee, Seung-Won Jung, Hyunseok Seo

Published: 2024-03-08 06:39:24+00:00

Comment: Accepted to AAAI 2024

AI Summary

This paper proposes STIG (Spectrum Translation for Refinement of Image Generation), a framework that leverages contrastive learning and spectral translation to mitigate frequency domain discrepancies in images generated by both GAN and diffusion models. STIG effectively refines the magnitude spectrum of generated images, leading to improved image quality and making them significantly harder for frequency-based deepfake detectors to identify. The framework achieves state-of-the-art performance in reducing spectral anomalies and enhancing realism.

Abstract

Currently, image generation and synthesis have remarkably progressed with generative models. Despite photo-realistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the generated images to improve generative performance of both GAN and diffusion models. This is realized by spectrum translation for the refinement of image generation (STIG) based on contrastive learning. We adopt theoretical logic of frequency components in various generative networks. The key idea, here, is to refine the spectrum of the generated image via the concept of image-to-image translation and contrastive learning in terms of digital signal processing. We evaluate our framework across eight fake image datasets and various cutting-edge models to demonstrate the effectiveness of STIG. Our framework outperforms other cutting-edges showing significant decreases in FID and log frequency distance of spectrum. We further emphasize that STIG improves image quality by decreasing the spectral anomaly. Additionally, validation results present that the frequency-based deepfake detector confuses more in the case where fake spectrums are manipulated by STIG.


Key findings
STIG significantly reduces spectral discrepancies, as evidenced by considerable decreases in FID and log frequency distance (LFD) across various GANs and diffusion models, outperforming existing methods. The framework also demonstrably improves the visual quality of generated images in the spatial domain. Moreover, STIG-manipulated images severely confuse frequency-based deepfake detectors (both CNN and ViT-based), causing their detection accuracy to drop from near-perfect to significantly lower levels.
Approach
The STIG framework refines generated images by translating their magnitude spectrum to align with real image spectra using adversarial learning and patch-wise contrastive learning. It employs auxiliary regularizations, including a spectral discriminator with chessboard integration to match power spectral density and a low-frequency loss to preserve fundamental energy levels. This process, rooted in digital signal processing, directly manipulates frequency components to reduce artifacts and fill insufficient high-frequencies.
Datasets
Images generated from CycleGAN, StarGAN, StarGAN2, StyleGAN (trained on CelebA, AFHQ, FFHQ), DDPM, and DDIM (trained on FFHQ, LSUN Church).
Model(s)
For the STIG framework: Generator (Nested U-Net), Discriminator (PatchGAN), and Spectral Discriminator (simple fully connected layer). For evaluating against deepfake detectors (image-based, not audio-based): CNN-based frequency domain detector and ViT-B16.
Author countries
South Korea