🤖 AI Summary
This study introduces VSL-Bench, the first visual iconicity benchmark tailored for sign language, systematically evaluating 13 state-of-the-art vision-language models (VLMs) on three tasks: phonemic handshape prediction, transparency judgment, and graded iconicity scoring. Methodologically, it leverages authentic sign language videos, integrates psycholinguistic metrics, employs zero-shot and few-shot settings, models dynamic motion, conducts visual grounding analysis, and benchmarks against human behavioral data. Its key contribution lies in embedding cognitive-linguistic iconicity theory deeply into VLM evaluation, thereby uncovering how models perceive dynamic visual–semantic mappings. Results show that VLMs achieve near-human performance in phonemic prediction but underperform significantly on transparency judgment. Only the top-performing model exhibits a moderate correlation (r ≈ 0.45) with human iconicity ratings, confirming that robust phonological modeling is a critical prerequisite for iconicity understanding.
📝 Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the extit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On extit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on extit{transparency}, they are far from human baselines; and only top models correlate moderately with human extit{iconicity ratings}. Interestingly, extit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.