MMPB: It's Time for Multi-Modal Personalization

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large vision-language models (VLMs) lack systematic evaluation of visual personalization—specifically, user preference understanding, conversational consistency, and visual adaptation. Method: We introduce MMPB, the first benchmark for multimodal personalization in VLMs, comprising 10K image-query pairs and 111 customizable concepts spanning four categories (people, animals, objects, and characters). We propose a structured evaluation framework with three core tasks—preference understanding, dialogue consistency, and visual alignment—and a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying, incorporating preference-grounded queries and long-context assessment. Contribution/Results: Empirical evaluation across 23 state-of-the-art VLMs reveals pervasive issues: response refusal, long-context forgetting, and vision–semantics misalignment. MMPB provides a scalable, reproducible evaluation standard and diagnostic toolkit for personalized multimodal AI.

Technology Category

Application Category

📝 Abstract

Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

Problem

Research questions and friction points this paper is trying to address.

Evaluating personalization capabilities of Vision-Language Models through a benchmark

Addressing VLMs' struggles with consistency, preferences, and visual adaptation

Providing scalable framework for improving multi-modal AI personalization in applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed first extensive benchmark for VLM personalization

Structured personalization into three key task types

Implemented three-stage evaluation protocol for performance

🔎 Similar Papers

No similar papers found.

Authors to Follow