FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

View on arXiv ← Back to list

Authors: Yongze Li, Ning Li, Ajian Liu, Hui Ma, Liying Yang, Xihong Chen, Zhiyao Liang, Yanyan Liang, Jun Wan, Zhen Lei

Published: 2025-04-01 06:19:50+00:00

AI Summary

FA³-CLIP, a unified face attack detection model, leverages attack-agnostic prompt learning and frequency-aware cues fusion to detect both physical and digital face attacks. It achieves state-of-the-art results by incorporating frequency information to complement spatial features and learning a unified feature space for diverse attack types.

Abstract

Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FAtextsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.

Key findings

FA³-CLIP significantly outperforms existing methods in detecting both physical and digital face attacks across multiple datasets and newly established protocols. The inclusion of frequency information and attack-agnostic prompt learning proves crucial for achieving state-of-the-art performance. The ablation studies demonstrate the effectiveness of each component in the model.

Approach

FA³-CLIP uses attack-agnostic prompt learning in the language branch to generate generic live and fake prompts, improving the representation of live and fake faces. A dual-stream cues fusion framework in the vision branch combines spatial and frequency features to enhance discriminative power.

Datasets

UniAttackData [13], JFSFDB [45]

Model(s)

Vision Transformer (ViT-B/16) [79] for image encoding and a pre-trained Transformer for text encoding. The architecture also includes custom modules for frequency feature extraction, fusion, and compression.

Author countries

China

← Previous