Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning

View on arXiv ← Back to list

Authors: Ajian Liu, Haocheng Yuan, Xiao Guo, Hui Ma, Wanyi Zhuang, Changtao Miao, Yan Hong, Chuanbiao Song, Jun Lan, Qi Chu, Tao Gong, Yanyan Liang, Weiqiang Wang, Jun Wan, Xiaoming Liu, Zhen Lei

Published: 2025-05-19 16:35:45+00:00

AI Summary

This paper introduces UniAttackDataPlus, the largest unified face attack detection dataset to date, containing 697,347 videos with 54 attack types applied to 2,875 identities. It also proposes HiPTune, a hierarchical prompt tuning framework that adaptively selects classification criteria from different semantic spaces for robust face attack detection.

Abstract

PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively. However, isolated training of these two models significantly increases vulnerability towards unknown attacks, burdening deployment environments. The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors: (1) A benchmark that is sufficient for models to explore is lacking. Existing UAD datasets only contain limited attack types and samples, leading to the model's confined ability to address abundant advanced threats. In light of these, through an explainable hierarchical way, we propose the most extensive and sophisticated collection of forgery techniques available to date, namely UniAttackDataPlus. Our UniAttackData+ encompasses 2,875 identities and their 54 kinds of corresponding falsified samples, in a total of 697,347 videos. (2) The absence of a trustworthy classification criterion. Current methods endeavor to explore an arbitrary criterion within the same semantic space, which fails to exist when encountering diverse attacks. Thus, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces. Specifically, we construct a VP-Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts guiding the encoder to extract discriminative features at different levels in a coarse-to-fine manner. Finally, to help the model understand the classification criteria in visual space, we propose a DPI module to project the visual prompts to the text encoder to help obtain a more accurate semantics.

Key findings

HiPTune significantly outperforms existing methods on the proposed UniAttackDataPlus dataset and other benchmarks across various protocols. The hierarchical prompt tuning strategy proves effective in improving generalization to unseen attack types and domains. The results demonstrate the effectiveness of the proposed dataset and model in robust and generalized face attack detection.

Approach

The authors address the problem of unified face attack detection by creating a large, hierarchical dataset (UniAttackDataPlus) and developing HiPTune, a hierarchical prompt tuning framework. HiPTune uses a Visual Prompt Tree to explore multiple classification criteria and an Adaptive Prompt Pruning mechanism to select the best criteria for each sample, improving robustness and generalization.

Datasets

UniAttackDataPlus, JFSFDB, UniAttackData, CASIA-SURF, CASIA-SURF CeFA, CASIA-SURF HiFi-Mask, SiW, 3DMAD, HKBU, FF++, DFDC, CelebDFv2, MSU, 3DMask, ROSE

Model(s)

CLIP (with ViT/b-16 backbone), various CLIP-based baselines (CLIP-V, CoOp), CDCN++, CFPL-FAS, MoE-FFD, LLA-Net, FA3-CLIP, ResNet-50, ViT-B/16

Author countries

China, Hong Kong, USA, UK

← Previous