Benchmarking Foundation Models for Zero-Shot Biometric Tasks

View on arXiv ← Back to list

Authors: Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross

Published: 2025-05-30 04:53:55+00:00

AI Summary

This research benchmarks the zero-shot and few-shot performance of 41 vision-language models (VLMs) on six biometric tasks (face verification, soft biometric attribute prediction, iris recognition, PAD, and face manipulation detection). The findings demonstrate the surprising effectiveness of these pre-trained models on various biometric tasks, even without fine-tuning.

Abstract

The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.

Key findings

Several VLMs achieved high accuracy in face recognition (over 90% TMR@1%FMR) and iris recognition (over 96% TMR@1%FMR on IITD-R) without fine-tuning. DINO and DINOv2 models performed exceptionally well in iris-related tasks. InternVL3 showed superior performance in deepfake and morph attack detection.

Approach

The researchers used the image encoders of various pre-trained VLMs to extract embeddings from face and iris images. These embeddings were then used for zero-shot and few-shot biometric tasks, either directly (e.g., face verification using cosine similarity) or after training a simple classifier head (e.g., deepfake detection).

Datasets

AgeDB, LFW, CPLFW, CFP-FP, CFP-FF, VMER, UND-Iris-0405, IIT-Delhi-Iris (IITD-R), IIT-Delhi Contact Lens Iris Database (IITD-P), FaceForensics++, AMSL, FRLL-Morphs, MorDiff

Model(s)

CLIP, ALIGN, OpenCLIP, DINO, SAM, DINOv2, BLIP, BLIP-2, LLaVA, Kosmos-2, Chameleon, DeepSeek-VL, InternVL-3, DeepSeek-VL2, ViT-B-16, ViT-B-16-384, ViT-B-32-384, ViT-H-14-224-21k, ViT-Hybrid-384, ViT-L-16-224, ViT-L-16-384, ViT-L-32-384

Author countries

USA

← Previous