DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes DiMo, a unified discrete diffusion-based model that overcomes the limitations of existing text-motion generation approaches, which are typically confined to unidirectional tasks. DiMo enables bidirectional understanding and generation—supporting text-to-motion, motion-to-text, and unconditional motion synthesis—within a single architecture through iterative masked token refinement. It leverages residual vector quantization (RVQ) to enhance motion token fidelity and integrates group relative policy optimization (GRPO) to improve semantic alignment and controllability. The model further allows for quality-latency trade-offs during inference and performs motion completion, prediction, and caption correction without architectural modifications. Evaluated on HumanML3D and KIT-ML benchmarks, DiMo demonstrates state-of-the-art performance in both high-quality motion generation and bidirectional cross-modal understanding.

Technology Category

Application Category

📝 Abstract
Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change. Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.
Problem

Research questions and friction points this paper is trying to address.

motion generation
text-motion alignment
bidirectional understanding
masked modeling
motion representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Diffusion
Masked Modeling
Bidirectional Text-Motion
Residual Vector Quantization
Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.
N
Ning Zhang
Huawei Central Media Technology Institute
Zhengyu Li
Zhengyu Li
Peking University
Quantum Cryptography
K
K. Loh
Huawei Central Media Technology Institute
M
Mingxi Xu
Huawei Central Media Technology Institute
Qi Wang
Qi Wang
Beijing Institute of Technology
MLLMsVideo Understanding
Z
Zhengyu Wen
Huawei Central Media Technology Institute
X
Xiaoyu He
Huawei Central Media Technology Institute
Wei Zhao
Wei Zhao
Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Synthetic cell biologyMicrobiome
Kehong Gong
Kehong Gong
National University of Singapore
digital humandeep leanring
M
Mingyuan Zhang
Huawei Central Media Technology Institute