DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes DiMo, a unified discrete diffusion-based model that overcomes the limitations of existing text-motion generation approaches, which are typically confined to unidirectional tasks. DiMo enables bidirectional understanding and generation—supporting text-to-motion, motion-to-text, and unconditional motion synthesis—within a single architecture through iterative masked token refinement. It leverages residual vector quantization (RVQ) to enhance motion token fidelity and integrates group relative policy optimization (GRPO) to improve semantic alignment and controllability. The model further allows for quality-latency trade-offs during inference and performs motion completion, prediction, and caption correction without architectural modifications. Evaluated on HumanML3D and KIT-ML benchmarks, DiMo demonstrates state-of-the-art performance in both high-quality motion generation and bidirectional cross-modal understanding.

Technology Category

Application Category

📝 Abstract

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change. Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

Problem

Research questions and friction points this paper is trying to address.

motion generation

text-motion alignment

bidirectional understanding

masked modeling

motion representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Diffusion

Masked Modeling

Bidirectional Text-Motion

Residual Vector Quantization

Group Relative Policy Optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow