MoSa: Motion Generation with Scalable Autoregressive Modeling

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low multi-scale modeling efficiency and poor reconstruction quality in text-driven 3D human motion generation, this paper proposes MoSa—a hierarchical residual vector-quantized autoregressive generative framework. Its core innovation is a multi-scale token retention strategy that enables efficient, scalable modeling from coarse to fine granularity. Additionally, we design a lightweight convolution-attention hybrid CAQ-VAE to mitigate reconstruction degradation caused by conventional interpolation-based upsampling. MoSa achieves high-fidelity motion generation in only ten autoregressive decoding steps. On the Motion-X benchmark, it attains an FID of 0.06—significantly outperforming MoMask (0.20)—while accelerating inference by 27%. Crucially, MoSa supports downstream tasks such as motion editing without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web
Problem

Research questions and friction points this paper is trying to address.

Generating 3D human motion from text descriptions efficiently
Reducing inference steps in motion generation using scalable autoregressive modeling
Improving motion reconstruction quality with a hybrid VQ-VAE architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical motion generation with scalable autoregressive modeling
Multi-scale token preservation strategy in RQ-VAE
Lightweight convolution-attention hybrid VQ-VAE design
🔎 Similar Papers
No similar papers found.
M
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Sheng Yan
Sheng Yan
Chongqing University of Technology
cross-modal retrievaltemporal groundingmotion generation
Y
Yong Wang
Chongqing University of Technology
Y
Yingjie Li
Tencent Technology Co., Ltd.
Gui-Bin Bian
Gui-Bin Bian
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
H
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School