π€ AI Summary
Current vision-language model evaluation benchmarks predominantly assess static capabilities and struggle to measure a modelβs ability to adapt in real time to usersβ dynamic preferences. This work proposes the first evaluation framework specifically designed for dynamic human preferences during inference, leveraging an automated data generation pipeline to construct an image-dependent multimodal preference dataset. The framework emphasizes contextual, personalized adaptation rather than reliance on generic preferences learned during training. Systematic evaluation using this benchmark reveals significant limitations of state-of-the-art vision-language models in handling such dynamic tasks, thereby establishing a new direction and providing essential data resources for future research.
π Abstract
Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.