Selective Contrastive Learning For Gloss Free Sign Language Translation

๐Ÿ“… 2026-04-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

184K/year
๐Ÿค– AI Summary
This work addresses the challenge of cross-modal misalignment between sign language videos and textual descriptions in the absence of gloss-level annotations. To this end, the authors propose a selective contrastive learning framework that introduces, for the first time, dynamic analysis of negative sample similarity. By leveraging trajectory information, the method identifies semantically irrelevant or noisy negative samples and incorporates a similarity-aware pair selection mechanism alongside a curriculum-based mini-batch training strategy to enhance the quality of contrastive supervision. The integration of CLIP-style visionโ€“language pretraining, dynamic negative sampling, and curriculum learning substantially improves the stability of cross-modal alignment, yielding higher translation accuracy and greater model robustness in gloss-free sign language translation.

Technology Category

Application Category

๐Ÿ“ Abstract
Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Translation
Gloss-Free
Contrastive Learning
Vision-Language Pretraining
Modality Mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Contrastive Learning
Pair Selection
Gloss-Free Sign Language Translation
Vision-Language Pretraining
Curriculum Negative Mining
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Changhao Lai
School of Informatics, Xiamen University, China; Key Lab of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian-Taiwan (XMU), Ministry of Culture and Tourism, China; National Language Resources Monitoring and Research Center for Education and Teaching Media, Xiamen University, China
Rui Zhao
Rui Zhao
National University of Singapore
Computer VisionMultimodalVision and LanguageVirtual HumansRemote Sensing
X
Xuewen Zhong
School of Informatics, Xiamen University, China; Key Lab of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian-Taiwan (XMU), Ministry of Culture and Tourism, China; National Language Resources Monitoring and Research Center for Education and Teaching Media, Xiamen University, China
Jinsong Su
Jinsong Su
Xiamen University
Natural Language ProcessingDeep LearningNeural Machine Translation
Yidong Chen
Yidong Chen
Xiamen University
Computer Vision3D Point Cloud Localization3D Object DetectionDeep learning