🤖 AI Summary
Existing annotation-free sign language translation methods achieve strong performance but suffer from high model complexity and computational overhead, limiting scalability to large-scale datasets. To address this, we propose a segment-aware visual tokenization framework: first, continuous sign videos are decomposed into semantic units via gesture segmentation; then, discrete visual tokens are generated, substantially compressing input sequence length. We further introduce a vocabulary-free, two-level contrastive alignment mechanism—operating at both token-to-token and hidden-state-to-language-embedding levels—to enable fine-grained cross-modal modeling. Evaluated on PHOENIX14T, our method surpasses state-of-the-art approaches: it reduces sequence length by 50%, cuts memory consumption by 2.67×, and achieves superior translation quality at equivalent sequence lengths. The approach thus jointly advances efficiency, scalability, and accuracy in annotation-free sign language translation.
📝 Abstract
Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.