A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing sign language generation methods, which often employ deterministic embeddings that lead to representation collapse and struggle to disentangle articulatory motion hierarchies or produce realistic gestures. To overcome these challenges, the authors propose an alignment-aware variational framework that, for the first time, integrates a variational autoencoder (VAE) to model distributed latent variables and learn disentangled representations of joint-level motion hierarchies. A non-autoregressive Transformer is trained under distributional supervision, enhanced with a lexical attention mechanism to strengthen text-to-motion alignment. Notably, the method generates high-quality sign language motions without requiring explicit lexical input and achieves state-of-the-art performance on back-translation tasks, significantly improving both the realism of generated motions and their semantic consistency with input text.

Technology Category

Application Category

📝 Abstract
Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Production
Disentangled Representation
Latent Modeling
Motion Realism
Text-to-Sign Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation
variational modeling
sign language production
non-autoregressive transformer
alignment-aware
S
Sümeyye Meryem Taşyürek
Hacettepe University, Computer Engineering Department, Ankara, Türkiye
E
Enis Mücahid İskender
Hacettepe University, Computer Engineering Department, Ankara, Türkiye
Hacer Yalim Keles
Hacer Yalim Keles
Hacettepe University, Computer Engineering Department
computer visionmachine learninggenerative adversarial networks