🤖 AI Summary
This work addresses the limitations of existing sign language generation methods, which often employ deterministic embeddings that lead to representation collapse and struggle to disentangle articulatory motion hierarchies or produce realistic gestures. To overcome these challenges, the authors propose an alignment-aware variational framework that, for the first time, integrates a variational autoencoder (VAE) to model distributed latent variables and learn disentangled representations of joint-level motion hierarchies. A non-autoregressive Transformer is trained under distributional supervision, enhanced with a lexical attention mechanism to strengthen text-to-motion alignment. Notably, the method generates high-quality sign language motions without requiring explicit lexical input and achieves state-of-the-art performance on back-translation tasks, significantly improving both the realism of generated motions and their semantic consistency with input text.
📝 Abstract
Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.