Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

Authors: Ashutosh Chaubey, Xulang Guan, Mohammad Soleymani

Published: 2025-04-09 18:26:07+00:00

AI Summary

Face-LLaVA is a multimodal large language model for face analysis tasks, including deepfake detection. It uses a novel face-specific visual encoder and a large instruction-tuned dataset, FaceInstruct-1M, achieving superior performance compared to existing open-source models.

Abstract

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.


Key findings
Face-LLaVA outperforms existing open-source MLLMs on nine datasets across five face analysis tasks in a zero-shot setting. It achieves competitive performance compared to supervised task-specific methods and shows superior reasoning capabilities as evaluated by GPT-4.
Approach
Face-LLaVA uses instruction tuning with a new dataset, FaceInstruct-1M, containing one million samples of images and videos. It incorporates a novel face-specific visual encoder with Face-Region Guided Cross-Attention that integrates face geometry with local visual features.
Datasets
FaceInstruct-1M (created by the authors from RAF-DB, AffectNet, DFEW, FERV39K, DISFA, BP4D, AffWild2, CelebA, MORPH II, UTK Face, FF++, Fake-AVC, EMER, MAFW, EmoVIT, MERR, FABAInstruct), DFEW, Crema-D, RAF-DB, DISFA, BP4D, FaceForensics++, MORPH II, UTK Face, CelebA
Model(s)
Face-LLaVA (a multimodal large language model with a LanguageBind vision encoder and a Vicuna-7B backbone, incorporating a novel Face-Region Landmark Projector and Face-Region Guided Cross-Attention)
Author countries
USA