Robust Deepfake Detection, NTIRE 2026 Challenge: Report

Authors: Benedikt Hopf, Radu Timofte, Chenfan Qu, Junchi Li, Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo, Yongwei Tang, Zhiqiang Yang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran, Chih-Yu Jian, Yi-Fan Wang, Bang-Kang Chen, You-Chen Chao, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu, Aashish Negi, Hardik Sharma, Prateek Shaily, Jayant Kumar, Sachin Chaudhary, Akshay Dudhane, Praful Hambarde, Amit Shukla, Jielun Peng, Yabin Wang, Yaqi Li, Jincheng Liu, Xiaopeng Hong, Krish Wadhwani, Liam Fitzpatrick, Utkarsh Tiwari, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Cristian Lazo Quispe, Aishwarya A, Akshara S, Ashwathi N, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi

Published: 2026-04-27 08:19:53+00:00

AI Summary

This report summarizes the NTIRE 2026 Robust Deepfake Detection Challenge, which aimed to address the critical issue of deepfake detector performance under various image degradations. Participants developed detection models tested on unknown sets with common and uncommon degradations. The top-performing methods predominantly leveraged large foundation models, ensemble techniques, and degradation-aware training strategies to enhance robustness and generalization.

Abstract

Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector's weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.


Key findings
The NTIRE 2026 challenge successfully underscored the critical need for robustness in deepfake detection, attracting significant community participation. Key findings reveal that high-quality pretraining, especially with large foundation models, is essential for preventing overfitting and improving generalization. Furthermore, exposing models to diverse image degradations during training significantly enhances their robustness against real-world deepfake manipulations.
Approach
The NTIRE 2026 challenge focused on developing deepfake detectors that are robust against image degradations, including both accidental and malicious types. Participants were given a training set with basic degradations and tasked to generalize to an unknown test set featuring diverse and stronger degradations. Top solutions emphasized using large pre-trained foundation models, forming ensembles, and training with extensive degradation augmentation to improve detection robustness.
Datasets
The challenge's base images were derived from CelebV-HQ [95], with fake images generated using FaceSwap [31], StyleFeatureEditor [4], and FSGAN [51]. Degradations were based on PMM [24]. Participants utilized additional public datasets such as DDL [48], FaceForensics++ [62], DFDC [13], FakeAVCeleb [29], Celeb-DF++ [41], DF40 [88], Celeb-DF-v3 [41], DeeperForensics-1.0 [27], HIDF [28], RedFace [65], Celeb-DF-v2 [40], DeepFakeDetection [11], DFDCP [12], FFIW [94], FaceShifter [36], UADFV [39], and OpenMMSec [15], along with Self-Blended Images (SBI) [66] for training.
Model(s)
The top-performing methods predominantly utilized large Vision Transformers and foundation models. Specific architectures include DINOv3-Large/Huge/ViT/B/16 [67, 52], MetaCLIP2-Huge [7], CLIP ViT-L/14/Large/B/16 [14, 60], SigLIP [92], EVA-giant [18], EVA02-Large [19], I-JEPA [3], FSFM ViT-B [76], ConvNeXt-Small/Large [46], DeiT [73], and EfficientNet-V2 [71]. Many solutions also incorporated ensembles, LoRA [26] for fine-tuning, Multi-Instance Learning (MIL), AttentionPooling, and Group Distributionally Robust Optimization (GroupDRO) [63].
Author countries
Germany, China, Vietnam, Taiwan, USA, India, United Arab Emirates, Ireland, Saudi Arabia, Peru