SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sign language translation methods often neglect non-manual cues—such as lip movements—leading to ambiguity in visually similar signs. To address this, we propose SignClip, the first framework to systematically integrate spatial hand gestures and dynamic lip motion features via a hierarchical contrastive learning mechanism, enabling fine-grained semantic alignment across manual and non-manual modalities. Our approach employs a shared visual backbone to extract dual-modality features, incorporates multi-level contrastive fusion, and leverages large language models to enhance cross-modal understanding. Evaluated on PHOENIX14T (vocabulary-free setting) and How2Sign, SignClip achieves new state-of-the-art results: BLEU-4 of 24.71 and ROUGE-L of 48.38 on PHOENIX14T—demonstrating substantial improvements in disambiguating visually confusable signs through effective integration of non-manual cues.

Technology Category

Application Category

📝 Abstract
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
Problem

Research questions and friction points this paper is trying to address.

Improving sign language translation accuracy by integrating manual and non-manual cues
Addressing ambiguity in visually similar signs through lip movement features
Ensuring semantic consistency across sign-lip and visual-text modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of manual and mouthing cues
Hierarchical contrastive learning with multi-level alignment
Lip movement features integrated for disambiguation