🤖 AI Summary
Current sign language generation methods prioritize semantic accuracy while neglecting emotional expressiveness, resulting in synthesized videos with limited naturalness and vitality. To address this, we propose a multi-emotion-guided semantic disentanglement framework that explicitly separates emotion- and semantics-related features via a dedicated disentanglement module and employs a progressive training strategy for fine-grained emotion integration. In pose generation, we model confidence scores across seven distinct emotion categories and leverage a diffusion-based decoder to synthesize high-fidelity sign language pose sequences. To our knowledge, this is the first work to jointly incorporate explicit emotion control and semantic disentanglement into a text-to-sign generation pipeline. Evaluated on mainstream benchmarks, our method achieves significantly higher pose accuracy than all baselines while producing videos that simultaneously preserve semantic correctness and rich emotional expressiveness—thereby enhancing natural communication experiences for deaf users.
📝 Abstract
Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.