AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current continuous sign language recognition (CSLR) approaches rely on multi-stage pipelines, leading to error propagation, overfitting, and poor scalability to vocabulary expansion. To address these limitations, this paper proposes an end-to-end autoregressive translation framework that directly maps sign language pose sequences to natural language text, eliminating intermediate gloss representations and explicit alignment mechanisms. Methodologically, we employ a 1D convolutional neural network to compress temporal pose features and adapt the pre-trained Arabic language model AraGPT2 as the decoder for cross-modal generation. Evaluated on the Isharah-1000 dataset, our approach achieves substantial improvements over state-of-the-art methods, reducing word error rate (WER) by up to 6.1 percentage points. These results validate the effectiveness and generalization advantage of the end-to-end paradigm, as well as the efficacy of joint body-and-hand pose representation for CSLR.

Technology Category

Application Category

📝 Abstract
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1% in WER score compared to the best existing method.
Problem

Research questions and friction points this paper is trying to address.

Recognizing continuous sign gestures to bridge communication gaps
Overcoming error propagation in multi-stage sign recognition pipelines
Directly translating pose sequences to text without alignment bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive decoder-only transformer for direct translation
Temporal compression with 1D CNNs for pose sequences
Pre-trained Arabic decoder (AraGPT2) for gloss generation
🔎 Similar Papers
No similar papers found.
S
Samuel Ebimobowei Johnny
Carnegie Mellon University Africa, Kigali, Rwanda
B
Blessed Guda
Carnegie Mellon University Africa, Kigali, Rwanda
A
Andrew Blayama Stephen
Carnegie Mellon University Africa, Kigali, Rwanda
Assane Gueye
Assane Gueye
Associate Teaching Professor
Carnegie Mellon University Africa