PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of cross-skeleton motion transfer and modeling temporal inconsistencies and point indistinguishability in temporal point clouds (TPCs), this paper proposes PUMPS—the first general-purpose motion pretraining framework for sequential point clouds. Methodologically, it introduces: (1) a learnable, skeleton-agnostic feature representation based on Gaussian-noise identifiers; (2) a lightweight linear assignment strategy for point correspondence, replacing computationally expensive point-wise attention mechanisms; and (3) a frame-level point cloud encoder–latent-space decoder architecture, jointly optimized via self-supervised pretraining and task-specific fine-tuning. Without requiring native labeled data, PUMPS achieves state-of-the-art performance directly after pretraining. After fine-tuning, it consistently outperforms dedicated methods on downstream tasks—including motion denoising and action estimation—demonstrating superior generalizability and structural universality across diverse skeletal configurations and motion dynamics.

Technology Category

Application Category

📝 Abstract
Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.
Problem

Research questions and friction points this paper is trying to address.

Overcoming skeleton differences for cross-compatible motion synthesis
Enabling direct motion task learning with Temporal Point Clouds
Addressing temporal consistency in point-based motion representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoencoder for Temporal Point Clouds data
Linear assignment optimizes reconstruction
Latent features enable unsupervised pre-training
🔎 Similar Papers
No similar papers found.
C
Clinton Ansun Mo
School of Computer Science, The University of Sydney, NSW 2006, Australia; The University of Tokyo, Bunkyo City, Tokyo, Japan
K
Kun Hu
School of Science, Edith Cowan University, WA 6027, Australia
Chengjiang Long
Chengjiang Long
Research Engineer/Tech Leader at ByteDance Inc.
Computer VisionComputer GraphicsMultimediaMachine LearningArtificial Intelligence
Dong Yuan
Dong Yuan
the University of Sydney
cloud and edge computingAIdeep learninginternet of thingsworkflow
Wan-Chi Siu
Wan-Chi Siu
Hong Kong Polytechnic University
swimming
Z
Zhiyong Wang
School of Computer Science, The University of Sydney, NSW 2006, Australia