🤖 AI Summary
This work addresses a critical limitation in existing sign language recognition models, which predominantly rely on lexical sequence supervision and thus struggle to effectively model non-lexical spatial deixis—such as pointing gestures—that constitutes 10–15% of sign language content. The study presents the first systematic modeling and evaluation of spatial deixis in sign language, decomposing its parsing into two subtasks: index detection and discourse entity linking. To enhance performance, the authors introduce an auxiliary expert model that augments a frozen primary sign language recognizer during inference. Their approach incorporates a representation mechanism capable of automatically annotating non-lexical structures, thereby transcending the constraints of purely lexical paradigms. This method significantly improves the recognition and parsing of pointing gestures, establishing the first baseline for modeling non-lexical sign structures and opening a new avenue for future research.
📝 Abstract
Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.