Next-Scale Autoregressive Models for Text-to-Motion Generation

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Standard autoregressive models struggle to capture the long-range temporal structure inherent in text-to-motion generation. To address this limitation, this work proposes MoScale, a coarse-to-fine multi-scale autoregressive framework that hierarchically models global semantics and progressively refines motion details. The approach integrates cross-scale prediction correction with intra-scale local bidirectional re-prediction, enhancing robustness under limited training data and enabling zero-shot generalization to diverse motion generation and editing tasks. Experimental results demonstrate that MoScale achieves state-of-the-art performance in text-to-motion synthesis, while maintaining high training efficiency, strong scalability, and excellent generalization capabilities.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation

autoregressive models

temporal structure

long-range motion

next-token prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

next-scale autoregressive

hierarchical motion generation

text-to-motion