SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing sign language generation methods struggle to balance fluency and accuracy: end-to-end models suffer from regression-to-the-mean effects, while dictionary-based retrieval often yields unnatural motions. This work proposes a novel sparse keyframe-driven paradigm that leverages the FAST model to automatically extract temporally precise keyframes and introduces a Keyframe-to-Pose (KF2P) generation mechanism. Integrated within a reconstruction-guided Conditional Flow Matching (CFM) framework, the approach synthesizes high-fidelity, coherent 3D sign language sequences from sparse discrete anchors in fewer than ten sampling steps. The method enables precise spatiotemporal editing and achieves unified modeling across Chinese, English, German, and French for the first time, establishing state-of-the-art performance on multilingual benchmarks and constituting the largest multilingual sign language generation system to date.

Technology Category

Application Category

📝 Abstract

Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Sign Language Production

regression-to-the-mean

robotic transitions

multilingual sign language

natural signing avatars

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Keyframe Learning

Conditional Flow Matching

Multilingual Sign Language Production

Keyframe-to-Pose Generation

3D Gaussian Splatting

🔎 Similar Papers

No similar papers found.

Authors to Follow