🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on Chinese sign language understanding, particularly regarding authority, multimodal alignment, and articulatory diversity. To bridge this gap, we introduce CNSL-bench, the first multimodal benchmark for Chinese sign language grounded in the National Common Sign Language Dictionary, integrating text, images, and sign language videos that encompass diverse articulation forms such as air writing, fingerspelling, and Chinese manual alphabet signs. Leveraging fine-grained action categorization and multimodal alignment techniques, we construct a structured evaluation suite. Benchmarking 21 prominent MLLMs reveals that current models substantially underperform human-level comprehension of Chinese sign language, exhibit significant performance disparities across modalities and articulation types, and show considerable inter-model variation in robustness to instruction following.
📝 Abstract
Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.