Benchmarking Foundation Models for Zero-Shot Biometric Tasks

Authors: Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross

Published: 2025-05-30 04:53:55+00:00

AI Summary

This work introduces a comprehensive benchmark evaluating the zero-shot and few-shot performance of 41 state-of-the-art Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs) across six diverse biometric tasks, including face and iris recognition, soft biometric attribute prediction, presentation attack detection, and face manipulation detection (morphs and deepfakes). The study demonstrates that embeddings from these foundation models can be effectively leveraged for these tasks with varying degrees of success, often achieving high accuracy without task-specific fine-tuning, thereby highlighting their potential in biometric recognition and analysis.

Abstract

The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.


Key findings
Several foundation models, particularly OpenCLIP, BLIP, CLIP, and LLaVA, achieved remarkably strong zero-shot performance (over 90% TMR@1%FMR) in face recognition. DINO and DINOv2 variants consistently excelled in iris recognition and iris presentation attack detection, especially when paired with simple classifiers. For face deepfake and morph attack detection, InternVL3 models demonstrated superior performance, underscoring the potential of large-scale pre-trained multimodal models for diverse biometric security applications.
Approach
The researchers benchmark 41 publicly available Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs) by extracting visual embeddings from their image encoders. These embeddings are then evaluated for zero-shot and few-shot performance on six biometric tasks (face verification, soft biometrics, iris recognition, iris PAD, face morph/deepfake detection), sometimes utilizing a simple classifier head trained on the extracted features.
Datasets
AgeDB, Labeled Face in the Wild (LFW), Cross-Pose LFW (CPLFW), Celebrities in Frontal-Profile (CFP-FP), Celebrities in Frontal-Profile (CFP-FF), VGG-Face2 Mivia Ethnicity Recognition (VMER), UND-Iris-0405, IIT-Delhi-Iris (IITD-R), IIT-Delhi-Iris (IITD-P), FaceForensics++, AMSL, FRLL-Morphs (OpenCV, StyleGAN, WebMorph, FaceMorph), MorDiff, SMDD
Model(s)
UNKNOWN
Author countries
United States