🤖 AI Summary
This study investigates whether vision-language models (VLMs) exhibit human-like systematic biases in perceiving surface tilt angles. Employing psychophysical experimental paradigms alongside zero-shot and contextual prompting strategies, the authors evaluate geometric perception across diverse VLMs and model scales, complemented by supervised fine-tuning analyses. The work reports the first evidence of a pronounced anchoring effect in VLMs on low-level geometric tasks: models predominantly output predictions restricted to a few fixed angles (e.g., 0°, ±25°, ±45°), showing minimal sensitivity to continuous variations in field of view, optical slant, or surface curvature. While fine-tuning partially mitigates this bias, it remains persistent. These findings reveal inherent limitations in the interface between visual representations and linguistic output in VLMs, offering new insights into their geometric reasoning capabilities.
📝 Abstract
Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.