Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning

Authors: Ajian Liu, Haocheng Yuan, Xiao Guo, Hui Ma, Wanyi Zhuang, Changtao Miao, Yan Hong, Chuanbiao Song, Jun Lan, Qi Chu, Tao Gong, Yanyan Liang, Weiqiang Wang, Jun Wan, Xiaoming Liu, Zhen Lei

Published: 2025-05-19 16:35:45+00:00

AI Summary

This paper introduces UniAttackDataPlus (UniAttackData+), the most extensive and sophisticated dataset to date for Unified Face Attack Detection (UAD), encompassing 54 types of physical and digital attacks across 697,347 videos. It also proposes HiPTune, a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework. HiPTune adaptively explores multiple classification criteria from different semantic spaces using a Visual Prompt Tree, adaptive prompt pruning, and dynamic prompt integration to enhance detection robustness.

Abstract

PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively. However, isolated training of these two models significantly increases vulnerability towards unknown attacks, burdening deployment environments. The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors: (1) A benchmark that is sufficient for models to explore is lacking. Existing UAD datasets only contain limited attack types and samples, leading to the model's confined ability to address abundant advanced threats. In light of these, through an explainable hierarchical way, we propose the most extensive and sophisticated collection of forgery techniques available to date, namely UniAttackDataPlus. Our UniAttackData+ encompasses 2,875 identities and their 54 kinds of corresponding falsified samples, in a total of 697,347 videos. (2) The absence of a trustworthy classification criterion. Current methods endeavor to explore an arbitrary criterion within the same semantic space, which fails to exist when encountering diverse attacks. Thus, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces. Specifically, we construct a VP-Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts guiding the encoder to extract discriminative features at different levels in a coarse-to-fine manner. Finally, to help the model understand the classification criteria in visual space, we propose a DPI module to project the visual prompts to the text encoder to help obtain a more accurate semantics.


Key findings
HiPTune consistently outperforms various CLIP-based and specialized PAD/FFD baselines across all challenging protocols on the UniAttackData+, JFSFDB, and UniAttackData datasets. It demonstrates superior generalization capabilities and robustness against diverse, unseen, and complex attack types, achieving notably lower ACER values. Ablation studies confirm that both the hierarchical depth of the VP-Tree and the prompt length significantly contribute to the model's enhanced performance and fine-grained classification abilities.
Approach
The authors address unified face attack detection by first constructing UniAttackData+, a large-scale, hierarchically organized video dataset featuring diverse physical and digital attacks. Their proposed method, HiPTune, is a Visual-Language Model-based Hierarchical Prompt Tuning Framework. It uses a Visual Prompt Tree to organize classification criteria from coarse to fine, employs an Adaptive Prompt Pruning mechanism to dynamically select the most suitable prompts for each input, and integrates them via a Dynamic Prompt Integration module for cross-modal semantic alignment.
Datasets
UniAttackDataPlus (UniAttackData+), JFSFDB, UniAttackData, CASIA-SURF, CASIA-SURF CeFA (CeFA), CASIA-SURF HiFiMask (HiFiMask)
Model(s)
HiPTune (Hierarchical Prompt Tuning framework), built on a frozen CLIP model with a ViT/b-16 backbone, incorporating a Visual Prompt Tree (VP-Tree), Adaptive Prompt Pruning (APP) module, and Dynamic Prompt Integration (DPI) module.
Author countries
China, USA, UK