Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the limitation of existing speech-gesture co-modeling approaches, which often fail to capture the communicative intent of semantic gestures and remain confined to low-level motion features. To overcome this, the authors propose a Semantic Motion Anchor mechanism that discretizes 3D gestures into body-hand action primitives and converts them into structured natural language descriptions aligned with spoken text, thereby establishing a semantics-driven cross-modal contrastive learning framework. Notably, this approach pioneers the use of natural language—encoding both physical form and communicative intent—as gesture anchors, moving beyond conventional end-to-end continuous embedding alignment. Evaluated on the BEAT2 dataset, the method achieves an 8.2% improvement in text-to-gesture retrieval R@1, outperforms state-of-the-art methods in bidirectional retrieval, and generates outputs significantly preferred by users, demonstrating the efficacy of semantic anchors in conveying communicative intent.

📝 Abstract

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

Problem

Research questions and friction points this paper is trying to address.

co-speech gestures

semantic representation

gesture retrieval

communicative intent

text-motion alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic motion anchors

co-speech gestures

motion-text alignment