🤖 AI Summary
Sign language translation faces severe low-resource challenges due to the scarcity of large-scale, sentence-level aligned data. To address this, we propose a linguistics-inspired pose concatenation pretraining framework that enables end-to-end translation without requiring morpheme-level annotations. First, synthetic sentence pairs are generated using linguistic templates to provide strong supervised signals. Second, a pose sequence concatenation strategy is introduced to explicitly model temporal dependencies across gestures during pretraining. Our method employs a Transformer encoder-decoder architecture, jointly leveraging template-driven synthetic supervision and concatenation-augmented training. Evaluated on How2Sign and iSign, our approach achieves BLEU-4 scores of 4.56 (+2.59) and 3.43 (+2.88), respectively—substantially outperforming prior state-of-the-art methods. This work establishes a scalable, annotation-free paradigm for low-resource sign language translation.
📝 Abstract
Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. We propose POSESTITCH-SLT, a novel pre-training scheme that is inspired by linguistic-templates-based sentence generation technique. With translation comparison on two sign language datasets, How2Sign and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods for pose-based gloss-free translation. The results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.