A Dataset for Dynamic Human Preferences for Vision Language Models

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language model evaluation benchmarks predominantly assess static capabilities and struggle to measure a model’s ability to adapt in real time to users’ dynamic preferences. This work proposes the first evaluation framework specifically designed for dynamic human preferences during inference, leveraging an automated data generation pipeline to construct an image-dependent multimodal preference dataset. The framework emphasizes contextual, personalized adaptation rather than reliance on generic preferences learned during training. Systematic evaluation using this benchmark reveals significant limitations of state-of-the-art vision-language models in handling such dynamic tasks, thereby establishing a new direction and providing essential data resources for future research.
πŸ“ Abstract
Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.
Problem

Research questions and friction points this paper is trying to address.

dynamic human preferences
Vision Language Models
in-context learning
multi-modal preference
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic human preferences
vision-language models
in-context learning
multimodal benchmark
automated dataset generation