MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) for motion exhibit limited generalization, struggling to unify single- and multi-agent motion modeling as well as cross-modal tasks involving motion-text, motion-music, or motion-speech alignment. To address this, we propose the first unified generative and understanding framework for full-body motion. Our approach comprises three core components: (1) a high-fidelity, single-codebook motion tokenizer—HoMi Tokenizer; (2) MotionHub, a large-scale, multitask, multimodal motion dataset; and (3) an LLM-based architecture enabling semantic-motion cross-modal alignment, trained via a joint reconstruction and conditional generation paradigm. Our method achieves state-of-the-art performance on motion completion, text-driven two-person interactive motion generation, and all evaluated motion understanding benchmarks—covering the broadest task spectrum to date. Both the codebase and the MotionHub dataset are publicly released.

Technology Category

Application Category

📝 Abstract
This paper introduces MotionLLaMA, a unified framework for motion synthesis and comprehension, along with a novel full-body motion tokenizer called the HoMi Tokenizer. MotionLLaMA is developed based on three core principles. First, it establishes a powerful unified representation space through the HoMi Tokenizer. Using a single codebook, the HoMi Tokenizer in MotionLLaMA achieves reconstruction accuracy comparable to residual vector quantization tokenizers utilizing six codebooks, outperforming all existing single-codebook tokenizers. Second, MotionLLaMA integrates a large language model to tackle various motion-related tasks. This integration bridges various modalities, facilitating both comprehensive and intricate motion synthesis and comprehension. Third, MotionLLaMA introduces the MotionHub dataset, currently the most extensive multimodal, multitask motion dataset, which enables fine-tuning of large language models. Extensive experimental results demonstrate that MotionLLaMA not only covers the widest range of motion-related tasks but also achieves state-of-the-art (SOTA) performance in motion completion, interaction dual-person text-to-motion, and all comprehension tasks while reaching performance comparable to SOTA in the remaining tasks. The code and MotionHub dataset are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for motion synthesis and comprehension tasks
Enabling cross-modal conversion between motion, text, music, speech
Handling single-agent and multi-agent motions in one model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel motion tokenizer with VQ-VAE and flow matching
Autoregressive transformer for multi-task motion processing
Unified framework for cross-modal motion conversion
🔎 Similar Papers
Zeyu Ling
Zeyu Ling
Zhejiang University
Computer Vision
B
Bo Han
Zhejiang University, Zhejiang Lab
Shiyang Li
Shiyang Li
Amazon
Machine LearningNatural Language ProcessingTime Series Modeling
H
Hongdeng Shen
University of the Chinese Academy, Zhejiang Lab
J
Jikang Cheng
University of the Chinese Academy, Zhejiang Lab
C
Changqing Zou
Zhejiang University, Zhejiang Lab