Intentional Gesture: Deliver Your Intentions with Gestures for Speech

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing co-speech gesture generation methods rely solely on superficial linguistic cues (e.g., speech or text), neglecting underlying communicative intent—resulting in temporally synchronized yet semantically impoverished gestures. This work reframes gesture generation as an intent inference task and proposes an intent-driven generative framework: (1) it introduces communicative intent as the primary conditioning signal for the first time; (2) it constructs InG, the first large-scale gesture–intent paired dataset; and (3) it designs an intent-aware motion tokenizer enabling controllable mapping from high-level intent to low-level motion. The method integrates automatic intent annotation via large vision-language models, intent-conditioned motion tokenization, BEAT-2 data augmentation, and end-to-end joint intent–motion modeling. Evaluated on the BEAT-2 benchmark, it achieves new state-of-the-art performance, significantly improving semantic richness, temporal alignment, and expressive naturalness of generated gestures—advancing human-like nonverbal interaction capabilities for digital humans and embodied AI.

Technology Category

Application Category

📝 Abstract
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues ( extit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce extbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the extbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations ( extit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the extbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions ( extit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
Problem

Research questions and friction points this paper is trying to address.

Generating semantically meaningful co-speech gestures
Understanding communicative intentions behind human gestures
Improving gesture synthesis with intention-aware motion tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates gestures based on communicative intentions
Uses intention annotations from vision-language models
Tokenizes motion with high-level communicative functions
🔎 Similar Papers
No similar papers found.