Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing co-speech gesture generation methods rely on predefined labels or implicit pseudo-labels, leading to loss of motion details. To address this, we propose MECo, the first framework that explicitly incorporates raw motion examples as contextual prompts for large language models (LLMs), enabling joint multimodal understanding of speech and motion. MECo integrates speech encoding, motion feature extraction, and refined prompt engineering to support fine-grained control over body parts and accommodate diverse inputs—including motion clips, static poses, videos, and text. Evaluated on the FGD metric, action diversity, and example-generation similarity, MECo achieves state-of-the-art performance. It significantly improves semantic consistency between generated gestures and input speech, as well as motion fidelity—preserving nuanced kinematic details critical for naturalistic human animation.

Technology Category

Application Category

📝 Abstract

The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs' comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/MECo-Page.

Problem

Research questions and friction points this paper is trying to address.

Generating gestures controlled by motion examples and speech

Preserving motion details while ensuring speech-gesture alignment

Enabling multi-modal inputs for granular gesture control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs for speech and motion interpretation

Uses motion examples as explicit query contexts

Enables granular control of individual body parts

🔎 Similar Papers

No similar papers found.

Authors to Follow