MotionGPT3: Human Motion as a Second Modality

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current unified motion-language modeling faces two key challenges: (1) mismatch between continuous motion reconstruction and discrete representation, and (2) degradation of language capabilities due to multimodal joint training. To address these, we propose the first autoregressive framework for a unified motion–language bimodal diffusion model. Our approach employs a motion variational autoencoder (VAE) to encode raw human motion into continuous latent representations, and directly models these latents via a diffusion head—bypassing quantization entirely. We further introduce a dedicated motion branch alongside shared cross-modal attention, enabling bidirectional information exchange while preserving the architecture and linguistic competence of pretrained language models. Experiments demonstrate state-of-the-art performance on both motion understanding and generation tasks, with no compromise to language modeling capability. This work is the first to empirically validate the feasibility of efficiently co-modeling continuous motion and natural language within a unified diffusion paradigm.

Technology Category

Application Category

📝 Abstract
Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.
Problem

Research questions and friction points this paper is trying to address.

Bridging continuous motion and discrete language representation gap
Preventing language intelligence degradation in unified training
Enabling cross-modal interaction for motion-language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bimodal motion-language model with separate parameters
Shared attention mechanism for cross-modal interaction
Motion VAE and diffusion head bypass discrete tokenization
🔎 Similar Papers
No similar papers found.
B
Bingfan Zhu
Zhejiang University
Biao Jiang
Biao Jiang
Peking University
Computer vision
S
Sunyi Wang
Zhejiang University
S
Shixiang Tang
The Chinese University of HongKong
T
Tao Chen
Fudan University
Linjie Luo
Linjie Luo
Research Manager at ByteDance AI Lab
Computer GraphicsComputer Vision
Y
Youyi Zheng
Zhejiang University
X
Xin Chen
ByteDance