CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on Chinese sign language understanding, particularly regarding authority, multimodal alignment, and articulatory diversity. To bridge this gap, we introduce CNSL-bench, the first multimodal benchmark for Chinese sign language grounded in the National Common Sign Language Dictionary, integrating text, images, and sign language videos that encompass diverse articulation forms such as air writing, fingerspelling, and Chinese manual alphabet signs. Leveraging fine-grained action categorization and multimodal alignment techniques, we construct a structured evaluation suite. Benchmarking 21 prominent MLLMs reveals that current models substantially underperform human-level comprehension of Chinese sign language, exhibit significant performance disparities across modalities and articulation types, and show considerable inter-model variation in robustness to instruction following.

Technology Category

Application Category

📝 Abstract
Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
Problem

Research questions and friction points this paper is trying to address.

sign language understanding
multimodal large language models
Chinese National Sign Language
benchmarking
manual articulatory forms
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNSL-bench
sign language understanding
multimodal large language models
Chinese National Sign Language
articulatory diversity
Rui Zhao
Rui Zhao
National University of Singapore
Computer VisionMultimodalVision and LanguageVirtual HumansRemote Sensing
X
Xuewen Zhong
School of Informatics, Xiamen University, China; Key Lab of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian-Taiwan (XMU), Ministry of Culture and Tourism, China; National Language Resources Monitoring and Research Center for Education and Teaching Media, Xiamen University, China
X
Xiaoyun Zheng
School of Informatics, Xiamen University, China; Key Lab of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian-Taiwan (XMU), Ministry of Culture and Tourism, China; National Language Resources Monitoring and Research Center for Education and Teaching Media, Xiamen University, China
Jinsong Su
Jinsong Su
Xiamen University
Natural Language ProcessingDeep LearningNeural Machine Translation
Yidong Chen
Yidong Chen
Xiamen University
Computer Vision3D Point Cloud Localization3D Object DetectionDeep learning