Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

๐Ÿ“… 2025-12-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing pose-encoding-based text-to-3D-motion generation methods (e.g., CoMo) offer interpretability but suffer from limited reconstruction fidelity and local editability due to frame-wise modeling, which fails to capture temporal dynamics and high-frequency motion details. To address this, we propose a hybrid representation framework: (1) a pose-guided residual vector quantization (RVQ) tokenizer with a novel residual dropout mechanism to ensure semantic alignment and editing robustness; and (2) a two-stage Transformer architecture that jointly predicts pose tokens and residual tokens, enabling simultaneous global structural control and fine-grained temporal detail modeling. Evaluated on HumanML3D and KIT-ML, our method achieves significantly lower FID scores and superior reconstruction metrics compared to CoMo and state-of-the-art diffusion- and tokenization-based baselines. User studies confirm its intuitive editing capability and strong structural preservation.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Frรฉchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.
Problem

Research questions and friction points this paper is trying to address.

Enhances interpretable text-to-motion generation with residual refinement
Improves capture of temporal dynamics in pose-code-based motion frameworks
Enables intuitive, structure-preserving motion editing from text descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid representation combining pose and residual codes
Residual vector quantization for fine-grained temporal details
Two-stage Transformer for text-conditioned pose and residual prediction
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Sukhyun Jeong
Fintech and AI Robotics (FAIR) Laboratory, the School of Robotics, Kwangwoon University, Nowon-gu, Seoul 01897, South Korea
Yong-Hoon Choi
Yong-Hoon Choi
Kwangwoon University
Machine LearningCommunications Networks