SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic inconsistency and poor interpretability in speech-driven gesture generation, this paper proposes a novel framework integrating gesture behavior graphs with intent-chain reasoning. Methodologically, we introduce the first LLM-driven, structured intent-chain reasoning paradigm, which decomposes speech intents into multi-step semantic units and maps them onto graph-supported gesture labels. We further construct a lightweight intent-chain annotation dataset and a dedicated label generation model to achieve precise text-to-gesture semantic alignment. Experimental results demonstrate a gesture semantic alignment accuracy of 50.2% and an average inference latency of only 0.4 seconds per sample. The framework maintains high-fidelity co-synthesis of speech and gestures while substantially enhancing output credibility and interpretability—enabling transparent, stepwise intent-to-gesture mapping grounded in both linguistic semantics and gestural ontology.

Technology Category

Application Category

📝 Abstract
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
Problem

Research questions and friction points this paper is trying to address.

Generating semantically meaningful co-speech gestures from speech
Parsing speech content to produce reliable gesture labels
Ensuring gesture synthesis aligns with contextual intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based intent chain reasoning mechanism
Lightweight gesture label generation model
Interpretable intent reasoning pathway
🔎 Similar Papers
No similar papers found.
N
Nan Gao
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yihua Bao
Yihua Bao
Beijing Institute of Technology
Virtual realityDigital human,Human-Computer Interaction
D
Dongdong Weng
Beijing Engineering Research Center of Mixed Reality and Advanced Display, Beijing, China, and also with the Institute of Technology, Beijing, China
J
Jiayi Zhao
Beijing Engineering Research Center of Mixed Reality and Advanced Display, Beijing, China, and also with the Institute of Technology, Beijing, China
J
Jia Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Yan Zhou
Kuaishou Technology, Beijing, China
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
D
Di Zhang
Kuaishou Technology, Beijing, China