Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

📅 2024-11-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing sign language generation research predominantly focuses on “sign language → text” translation, while multilingual autoregressive “text → 3D sign language” generation remains underexplored. This paper introduces the first multilingual text-to-3D sign motion generation framework. Methodologically, it features: (1) a disentangled sign tokenization scheme that separately encodes hand gestures, body poses, and facial expressions; (2) a multi-head parallel autoregressive decoder for efficient temporal modeling; and (3) a retrieval-augmented, word-level sign conditioning mechanism that integrates external sign lexicon priors. The framework employs a pretrained language model as the text encoder and jointly optimizes generation fidelity and linguistic comprehensibility. Evaluated on multilingual sign language datasets, it achieves state-of-the-art performance. Both quantitative metrics and qualitative analysis confirm superior accuracy, fluency, and cross-lingual generalization of the generated 3D sign motions.

Technology Category

Application Category

📝 Abstract

Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task-sign language generation (text-to-sign)-remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. Code, models, and data will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Generates 3D sign avatars from text inputs.

Improves sign language generation efficiency with multi-head decoding.

Enhances sign precision using retrieval-enhanced external dictionaries.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual sign language model using pretrained LM

Decoupled tokenizer for discretizing continuous signs

Multi-head decoding for simultaneous token prediction

🔎 Similar Papers

No similar papers found.

Authors to Follow