Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Authors: Yuan-Chih Chen, Chun-Shien Lu

Published: 2026-02-26 08:47:48+00:00

AI Summary

This paper introduces a unified hidden-code recovery framework for natural image deepfake recovery and factual retrieval, moving beyond traditional detection and localization. The method encodes multi-scale semantic and perceptual information into a compact hidden-code using vector quantization and refines contextual reasoning via conditional Transformer modules. It also establishes ImageNet-S, a new benchmark for systematic evaluation, demonstrating promising retrieval and reconstruction performance across diverse watermarking pipelines.

Abstract

Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.


Key findings
The proposed method significantly outperforms existing inpainting-based and information-hiding baselines in factual retrieval accuracy (e.g., Top-1 label accuracy of 0.9231 on ImageNet-S). It demonstrates strong robustness against common image degradations and various forgery attacks due to its content-dependent watermarking. Qualitative results show superior visual coherence and detail restoration compared to other methods.
Approach
The method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization. This hidden-code is embedded into images using proactive watermarking techniques (both post-hoc and in-generation). For recovery, a conditional Transformer leverages the extracted hidden-code and a patch-level localization mask to reconstruct tampered image regions, followed by factual retrieval based on semantic similarity.
Datasets
ImageNet-S (constructed by the authors using ImageNet as base and LISA for segmentation masks)
Model(s)
VQ-VAE (specifically VAR [23] for multi-scale quantization), Conditional Transformer (with VAR architecture), EditGuard [30] (for post-hoc watermarking), Gaussian Shading-based VideoShield [9] (for in-generation watermarking), Stable Diffusion [19] (for generating deepfake images).
Author countries
Taiwan, ROC