🤖 AI Summary
To address semantic inconsistency and poor interpretability in speech-driven gesture generation, this paper proposes a novel framework integrating gesture behavior graphs with intent-chain reasoning. Methodologically, we introduce the first LLM-driven, structured intent-chain reasoning paradigm, which decomposes speech intents into multi-step semantic units and maps them onto graph-supported gesture labels. We further construct a lightweight intent-chain annotation dataset and a dedicated label generation model to achieve precise text-to-gesture semantic alignment. Experimental results demonstrate a gesture semantic alignment accuracy of 50.2% and an average inference latency of only 0.4 seconds per sample. The framework maintains high-fidelity co-synthesis of speech and gestures while substantially enhancing output credibility and interpretability—enabling transparent, stepwise intent-to-gesture mapping grounded in both linguistic semantics and gestural ontology.
📝 Abstract
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.