Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of a unified multimodal large language model (MLLM) for facial understanding, this paper introduces FaceLLM—the first MLLM specifically designed for face analysis. Methodologically, we propose a face-region-guided cross-modal attention visual encoder, construct FaceInstruct-1M—a million-scale facial instruction-tuning dataset—and develop a multi-task joint evaluation framework coupled with a GPT-driven automated reasoning benchmark. In terms of contributions and results, FaceLLM achieves state-of-the-art performance across five core facial understanding tasks—facial expression recognition, attribute analysis, age estimation, action unit detection, and deepfake identification—on nine established benchmarks, outperforming all existing open-source MLLMs. Under zero-shot settings, it delivers significantly higher inference quality than open-source alternatives and matches commercial-grade solutions. Both the FaceLLM model and the FaceInstruct-1M dataset are publicly released to foster community advancement.

Technology Category

Application Category

📝 Abstract
The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.
Problem

Research questions and friction points this paper is trying to address.

Develop a multimodal model for facial expression and attribute recognition
Create a face-centered database for instruction tuning MLLMs
Improve face processing tasks with a novel visual encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language model for face-centered learning
Face-Region Guided Cross-Attention visual encoder
FaceInstruct-1M database for instruction tuning
🔎 Similar Papers
No similar papers found.
Ashutosh Chaubey
Ashutosh Chaubey
CS PhD, University of Southern California
Computer VisionMultimodal AISpeech Processing
X
Xulang Guan
Institute for Creative Technologies, University of Southern California
M
Mohammad Soleymani
Institute for Creative Technologies, University of Southern California