Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

πŸ“… 2025-08-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing discrete pose encoding methods struggle to model fine-grained motion details, resulting in limited expressiveness and poor disentanglement. To address this, we propose a hierarchical pose representation framework that integrates discrete pose codes with continuous residual features via Residual Vector Quantization (RVQ), significantly enhancing motion detail fidelity while preserving interpretability and controllability. Our method employs an autoregressive decoder for end-to-end text-to-motion generation and is trained on the HumanML3D dataset. Experiments demonstrate substantial improvements: the FrΓ©chet Inception Distance (FID) drops from 0.041 to 0.015, and Top-1 R-Precision rises to 0.510. Qualitative evaluation confirms high precision, strong controllability, and semantic consistency in motion editing tasks. The core contribution lies in the first application of RVQ to text-driven motion generation, achieving an organic unification of discrete controllability and continuous expressiveness.

Technology Category

Application Category

πŸ“ Abstract
Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose codes further confirms the model's controllability for motion editing.
Problem

Research questions and friction points this paper is trying to address.

Enhancing expressiveness of discrete pose codes
Capturing fine-grained motion details effectively
Improving controllability for motion editing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual vector quantization for pose representation
Combining discrete codes with continuous motion features
Enhancing expressiveness while preserving interpretability
πŸ”Ž Similar Papers
No similar papers found.
S
Sukhyun Jeong
Division of Robotics, Kwangwoon University, Seoul, Korea
H
Hong-Gi Shin
Division of Robotics, Kwangwoon University, Seoul, Korea
Yong-Hoon Choi
Yong-Hoon Choi
Kwangwoon University
Machine LearningCommunications Networks