Intentional Gesture: Deliver Your Intentions with Gestures for Speech

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing co-speech gesture generation methods rely solely on superficial linguistic cues (e.g., speech or text), neglecting underlying communicative intent—resulting in temporally synchronized yet semantically impoverished gestures. This work reframes gesture generation as an intent inference task and proposes an intent-driven generative framework: (1) it introduces communicative intent as the primary conditioning signal for the first time; (2) it constructs InG, the first large-scale gesture–intent paired dataset; and (3) it designs an intent-aware motion tokenizer enabling controllable mapping from high-level intent to low-level motion. The method integrates automatic intent annotation via large vision-language models, intent-conditioned motion tokenization, BEAT-2 data augmentation, and end-to-end joint intent–motion modeling. Evaluated on the BEAT-2 benchmark, it achieves new state-of-the-art performance, significantly improving semantic richness, temporal alignment, and expressive naturalness of generated gestures—advancing human-like nonverbal interaction capabilities for digital humans and embodied AI.

Technology Category

Application Category

📝 Abstract

When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues ( extit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce extbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the extbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations ( extit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the extbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions ( extit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture

Problem

Research questions and friction points this paper is trying to address.

Generating semantically meaningful co-speech gestures

Understanding communicative intentions behind human gestures

Improving gesture synthesis with intention-aware motion tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates gestures based on communicative intentions

Uses intention annotations from vision-language models

Tokenizes motion with high-level communicative functions

🔎 Similar Papers

No similar papers found.

Authors to Follow