Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Sign Language Translation (SLT) has long relied on costly, incomplete manual gloss annotations, hindering effective modeling of the spatiotemporal complexity inherent in continuous sign language. To address this, we propose the first end-to-end, gloss-free natural SLT framework. Our method employs a dual-visual-encoder architecture to extract complementary frame-level and segment-level visual representations, and leverages contrastive learning to jointly align these visual features with sentence-level text embeddings—implicitly establishing cross-modal semantic correspondences during pretraining. Subsequently, an encoder–decoder architecture performs direct sign-to-text translation without explicit lexical supervision. This design significantly improves modeling of long-range dependencies and semantic coherence in continuous signing. Evaluated on the Phoenix-2014T benchmark, our model achieves a new state-of-the-art BLEU-4 score of 31.2, setting a benchmark for annotation-free SLT.

Technology Category

Application Category

📝 Abstract

Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Problem

Research questions and friction points this paper is trying to address.

Eliminate reliance on costly gloss annotations in sign language translation

Improve translation accuracy without intermediate gloss supervision

Enhance continuous sign language video-to-text conversion performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual visual encoders for feature extraction

Contrastive visual-language pretraining alignment

Fused visual features for encoder-decoder model

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale