Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics

Authors: Yuezun Li, Delong Zhu, Xinjie Cui, Siwei Lyu

Published: 2025-07-24 01:12:28+00:00

AI Summary

The paper introduces Celeb-DF++, a large-scale video deepfake benchmark dataset featuring diverse forgery types (face-swap, face-reenactment, talking-face) generated by 22 different methods. It also presents evaluation protocols to assess the generalizability of deepfake detection methods, revealing the limitations of current techniques.

Abstract

The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for textit{generalizable forensics}, ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.


Key findings
The results demonstrate that generalizable deepfake detection remains a significant challenge. Celeb-DF++ proves more difficult than previous benchmarks, with a substantial drop in detection accuracy across various models. The evaluation protocols highlight the impact of video compression and dataset variation on detector performance.
Approach
Celeb-DF++ expands upon the Celeb-DF dataset by incorporating a wider range of state-of-the-art deepfake generation methods across three forgery scenarios. It proposes three evaluation protocols (GF-eval, GFQ-eval, GFD-eval) to measure the generalizability of deepfake detectors by testing across different methods, compression levels, and datasets.
Datasets
Celeb-DF++, FaceForensics++ (FF++) (for training detectors in comparative experiments), VoxCeleb2 (for audio in talking-face generation)
Model(s)
24 recent deepfake detection models including MesoNet, MesoInception, Xception, EfficientNet-B4, Capsule, F3Net, CNN-Aug, FFD, SPSL, SRM, RFM, MATT, CLIP, RECCE, SBI, CORE, SIA, UCF, IID, LSDA, CFM, ProDet, ForAda, and Effort.
Author countries
China, USA