🤖 AI Summary
This work addresses the challenge of sign language translation, which suffers from scarce high-quality parallel video-text data and poor generalization on long-tail vocabulary and unseen structures. The authors propose a corpus augmentation method that requires no additional annotations, external videos, or generative models: hand gesture clips are extracted from existing annotated videos, paired with sentences generated by a large language model (LLM), and randomly concatenated to synthesize new RGB video–text pairs. Notably, abrupt visual transitions between segments act as an implicit regularizer, outperforming smooth transitions. Integrating CTC alignment, LLM-guided sentence generation, and multimodal representation transformation, the approach achieves a 2.92 BLEU-4 improvement over the GFSLT-VLP baseline under the same training framework, surpassing the previous state-of-the-art result by 0.98 BLEU-4.
📝 Abstract
Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at https://github.com/robizso/slt-datagen.