The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study introduces VSL-Bench, the first visual iconicity benchmark tailored for sign language, systematically evaluating 13 state-of-the-art vision-language models (VLMs) on three tasks: phonemic handshape prediction, transparency judgment, and graded iconicity scoring. Methodologically, it leverages authentic sign language videos, integrates psycholinguistic metrics, employs zero-shot and few-shot settings, models dynamic motion, conducts visual grounding analysis, and benchmarks against human behavioral data. Its key contribution lies in embedding cognitive-linguistic iconicity theory deeply into VLM evaluation, thereby uncovering how models perceive dynamic visual–semantic mappings. Results show that VLMs achieve near-human performance in phonemic prediction but underperform significantly on transparency judgment. Only the top-performing model exhibits a moderate correlation (r ≈ 0.45) with human iconicity ratings, confirming that robust phonological modeling is a critical prerequisite for iconicity understanding.

Technology Category

Application Category

📝 Abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the extit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On extit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on extit{transparency}, they are far from human baselines; and only top models correlate moderately with human extit{iconicity ratings}. Interestingly, extit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs on sign language form-meaning mapping through visual iconicity

Assessing VLMs' ability to recover mappings from dynamic human motion

Testing VLMs on phonological prediction, transparency, and iconicity ratings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-based benchmark for sign language evaluation

Assesses form-meaning mapping through three tasks

Compares vision-language models against human baselines

🔎 Similar Papers

No similar papers found.

Authors to Follow