Vision Language Models as Values Detectors

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the capability of vision-language models (VLMs) to detect implicit value-laden elements—such as safety, care, and privacy—in images, and assesses their alignment with human value judgments. Using a benchmark comprising 12 domestic-scene images annotated by 14 human raters, we systematically evaluate GPT-4o and LLaVA variants (7B, 13B, 34B) on value-perception tasks. We introduce a novel paradigm treating VLMs as “value detectors,” leveraging prompt engineering and comparative analysis against human annotations to quantify value sensitivity. Results show that LLaVA-34B achieves the highest performance, yet overall model–human alignment remains limited. Crucially, all models demonstrate non-trivial sensitivity to value-relevant visual and contextual cues, confirming their potential utility in socially aware robotics, assistive technologies, and interpretable human–AI interaction. This work establishes foundational methodology and empirical evidence for integrating ethical value awareness into multimodal AI systems.

Technology Category

Application Category

📝 Abstract

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Image Element Recognition

Human-machine Interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Visual Understanding

Human-like Perception

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

2024-04-19arXiv.orgCitations: 4

Authors to Follow