Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing pose-encoding-based text-to-3D-motion generation methods (e.g., CoMo) offer interpretability but suffer from limited reconstruction fidelity and local editability due to frame-wise modeling, which fails to capture temporal dynamics and high-frequency motion details. To address this, we propose a hybrid representation framework: (1) a pose-guided residual vector quantization (RVQ) tokenizer with a novel residual dropout mechanism to ensure semantic alignment and editing robustness; and (2) a two-stage Transformer architecture that jointly predicts pose tokens and residual tokens, enabling simultaneous global structural control and fine-grained temporal detail modeling. Evaluated on HumanML3D and KIT-ML, our method achieves significantly lower FID scores and superior reconstruction metrics compared to CoMo and state-of-the-art diffusion- and tokenization-based baselines. User studies confirm its intuitive editing capability and strong structural preservation.

Technology Category

Application Category

📝 Abstract

Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

Problem

Research questions and friction points this paper is trying to address.

Enhances interpretable text-to-motion generation with residual refinement

Improves capture of temporal dynamics in pose-code-based motion frameworks

Enables intuitive, structure-preserving motion editing from text descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid representation combining pose and residual codes

Residual vector quantization for fine-grained temporal details

Two-stage Transformer for text-conditioned pose and residual prediction

🔎 Similar Papers

No similar papers found.

Authors to Follow