DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

Authors: Siran Peng, Haoyuan Zhang, Li Gao, Tianshuo Zhang, Xiangyu Zhu, Bao Li, Weisong Zhao, Zhen Lei

Published: 2025-08-03 18:06:04+00:00

AI Summary

DiffusionFF is a novel diffusion-based framework for joint face forgery detection and fine-grained artifact localization. It utilizes a pretrained forgery detector as an artifact encoder and repurposes a denoising diffusion model as an artifact decoder, conditioned on multi-scale forgery-related features. By fusing the progressively synthesized artifact localization map with high-level semantic features, DiffusionFF significantly improves detection capability.

Abstract

The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusion-based framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder-decoder architecture: a pretrained forgery detector serves as a powerful artifact encoder, and a denoising diffusion model is repurposed as an artifact decoder. Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments show that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness and explainability.


Key findings
DiffusionFF achieves state-of-the-art (SOTA) performance in cross-dataset face forgery detection across multiple benchmarks and demonstrates superior effectiveness in fine-grained artifact localization compared to existing methods. The framework exhibits strong generality, acting as an explainable, plug-and-play enhancement for various existing forgery detectors, and maintains robustness under diverse degradation conditions.
Approach
The framework establishes an encoder-decoder architecture: a pretrained forgery detector acts as an 'artifact encoder' extracting multi-scale forgery-related features. A denoising diffusion model is repurposed as an 'artifact decoder,' conditioned on these features to progressively synthesize a detailed DSSIM-based artifact localization map. This estimated map is then fused with high-level semantic features from the detector via a gating mechanism to enhance detection performance.
Datasets
FaceForensics++ (FF++), Celeb-DeepFake-v2 (CDF2), DeepFake Detection Challenge (DFDC), DeepFake Detection Challenge Preview (DFDCP), FFIW-10K (FFIW)
Model(s)
ConvNeXt-B network (forgery detector/artifact encoder), U-Net architecture (denoising diffusion model/artifact decoder)
Author countries
China, Hong Kong, Macau