🤖 AI Summary
Sign language processing (SLP) research has long suffered from code fragmentation, hindering experimental reproducibility and cross-work benchmarking. To address this, we introduce SignLLM—the first deep-integration multimodal SLP framework built atop the Hugging Face ecosystem. SignLLM features a unified modality abstraction layer that enables joint modeling of heterogeneous inputs—including pose sequences and pixel-level video frames—for the first time. The framework comprises modular data preprocessing (with integrated pose estimation), flexible multimodal input interfaces, configurable training pipelines, and standardized evaluation protocols. Evaluated on sign language recognition and character-level pixel-based recognition tasks, SignLLM achieves state-of-the-art performance while substantially improving reproducibility and cross-team collaboration efficiency. This work establishes a scalable, reusable foundational architecture for large language model research beyond speech modalities.
📝 Abstract
In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers.
To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters.