Exploring Pose-based Sign Language Translation: Ablation Studies and Attention Insights

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the impact of pose data preprocessing—specifically standardization, temporal interpolation, and geometric augmentation—on continuous sign language translation (SLT) performance. Building upon an enhanced T5 encoder-decoder architecture, we conduct ablation experiments on YouTubeASL and How2Sign, quantitatively evaluating how each preprocessing strategy improves model robustness and generalization. Our key contributions are threefold: (1) a lightweight pose normalization and interpolation method explicitly designed for the spatiotemporal characteristics of sign language; (2) the introduction of learnable register tokens to significantly strengthen cross-frame pose modeling; and (3) attention visualization analysis that reveals the model’s focus on critical joints and semantic units, enhancing interpretability. Experimental results demonstrate that optimized preprocessing yields a +2.7 BLEU-4 gain. All code and preprocessed datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Sign Language Translation (SLT) has evolved significantly, moving from isolated recognition approaches to complex, continuous gloss-free translation systems. This paper explores the impact of pose-based data preprocessing techniques - normalization, interpolation, and augmentation - on SLT performance. We employ a transformer-based architecture, adapting a modified T5 encoder-decoder model to process pose representations. Through extensive ablation studies on YouTubeASL and How2Sign datasets, we analyze how different preprocessing strategies affect translation accuracy. Our results demonstrate that appropriate normalization, interpolation, and augmentation techniques can significantly improve model robustness and generalization abilities. Additionally, we provide a deep analysis of the model's attentions and reveal interesting behavior suggesting that adding a dedicated register token can improve overall model performance. We publish our code on our GitHub repository, including the preprocessed YouTubeASL data.
Problem

Research questions and friction points this paper is trying to address.

Impact of pose-based preprocessing on sign language translation
Effect of normalization, interpolation, augmentation on SLT performance
Analysis of attention mechanisms in transformer-based SLT models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-based data preprocessing techniques
Modified T5 encoder-decoder model
Dedicated register token addition
🔎 Similar Papers
No similar papers found.
T
Tomas Zelezny
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics
Jakub Straka
Jakub Straka
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics
V
Vaclav Javorek
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics
O
Ondrej Valach
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics
Marek Hruz
Marek Hruz
University of West Bohemia
artificial intelligenceimage processingmachine learning
I
Ivan Gruber
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics