Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
Current evaluations of vision-language models (VLMs) in urban perception tasks overlook critical aspects of human annotation—namely, inter-annotator disagreement, abstention behavior, and the negotiability of label spaces—thereby limiting their utility for trustworthy urban governance decisions. This work proposes a reliability-aware and negotiable evaluation framework that systematically incorporates inter-annotator agreement, abstention rates, and label negotiation mechanisms into VLM assessment for the first time. Leveraging multidimensional human annotations from 12 community participants across 100 Montreal street-view images and 30 perceptual dimensions, the study conducts a deterministic evaluation of seven zero-shot VLMs. Results reveal that model alignment with human consensus is strongly contingent on human annotation reliability and exhibits distributional mismatches—particularly in dimensions like “overall impression”—as evidenced by divergent rates of “not applicable” responses between models and humans.
📝 Abstract
Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Urban Perception
Benchmarking
Inter-annotator Reliability
Label Disagreement
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
urban perception
inter-annotator reliability
benchmarking
abstention-aware evaluation