Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

Authors: Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran

Published: 2026-04-28 17:32:48+00:00

Comment: 4th place (out of 94 teams) in the NTIRE 2026 Robust Deepfake Detection Challenge

AI Summary

This paper introduces a foundation-driven forensic framework to counter spatial attention drift in deepfake detection models under real-world degradations. It integrates an extreme compound degradation engine with a multi-stream architecture, comprising Global Texture, Localized Facial, and Hybrid Semantic Fusion pathways. This approach effectively extracts invariant geometric and semantic priors, leading to stabilized attention entropy and robust zero-shot generalization, securing Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge.

Abstract

Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.


Key findings
The proposed framework achieved Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge, with an AUC of 0.8775 on the Public test and 0.8523 on the Private test. The multi-stream architecture, combined with extreme compound degradation and a calibrated discretized voting mechanism, significantly mitigates spatial attention drift and texture bias. Qualitative and quantitative analyses (Score-CAM, Feature Cosine Similarity) confirmed that the individual streams extract complementary features and maintain robust representations under severe degradations.
Approach
The authors propose a multi-stream architecture anchored by a DINOv2-Giant backbone and a frozen CLIP-Large backbone, trained with an extreme compound degradation pipeline. This ensemble consists of three specialized pathways: a Localized Facial Stream, a Global Texture Stream, and a Hybrid Semantic Fusion Stream, which are designed to capture complementary features (local geometry, global context, and semantic integrity). Predictions from these streams are aggregated via a calibrated, discretized voting mechanism to mitigate attention drift and texture bias.
Datasets
FaceForensics++, UADFV, Celeb-DF-v2, Celeb-DF-v3, DeepFakeDetection, DFDC, DFDCP, FaceShifter, DeeperForensics-1.0, DDL, DF40, FFIW, HIDF, RedFace. Evaluated on NTIRE 2026 Robust Deepfake Detection Challenge datasets (Train, Validation, Public Test, Private Test).
Model(s)
DINOv2-Giant (backbone), CLIP-Large (backbone), LoRA (for parameter-efficient tuning of DINOv2).
Author countries
Vietnam