DinoLizer: Learning from the Best for Generative Inpainting Localization

Authors: Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas

Published: 2025-11-25 08:37:24+00:00

AI Summary

DinoLizer introduces a DINOv2-based model for localizing manipulated regions in generative image inpainting. The method leverages a DINOv2 backbone, pretrained for synthetic image detection, by adding a linear classification head on its Vision Transformer's patch embeddings. Utilizing a sliding-window strategy, DinoLizer generates high-resolution manipulation masks and demonstrates superior performance and robustness against post-processing compared to state-of-the-art detectors.

Abstract

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.


Key findings
DinoLizer significantly outperforms existing local manipulation detectors, achieving an average of 12% higher Intersection-over-Union (IoU) than the next best model. It demonstrates remarkable robustness against common post-processing operations like resizing, noise addition, and JPEG (double) compression. The research also highlights the strong representational power of Vision Transformers (DINOv2) for forgery localization and the benefits of treating auto-encoded regions as pristine during training.
Approach
DinoLizer uses a frozen DINOv2-B backbone, which is a Vision Transformer pre-trained for synthetic image detection, and adds a lightweight linear classification head on its patch embeddings to predict manipulations. This head is trained using Dice Loss to focus on semantically altered regions, treating auto-encoded content as pristine. A sliding-window strategy aggregates patch-level predictions over larger images, followed by post-processing to refine binary manipulation masks.
Datasets
B-Free, MS-COCO, Beyond the Brush, CocoGlide, TGIF, SAGI-SP, SAGI-FR
Model(s)
DINOv2 (ViT-B/14 architecture), DINOv3 (ViT-B/16 architecture) for ablation studies
Author countries
France