🤖 AI Summary
Addressing the challenges of cross-skeleton motion transfer and modeling temporal inconsistencies and point indistinguishability in temporal point clouds (TPCs), this paper proposes PUMPS—the first general-purpose motion pretraining framework for sequential point clouds. Methodologically, it introduces: (1) a learnable, skeleton-agnostic feature representation based on Gaussian-noise identifiers; (2) a lightweight linear assignment strategy for point correspondence, replacing computationally expensive point-wise attention mechanisms; and (3) a frame-level point cloud encoder–latent-space decoder architecture, jointly optimized via self-supervised pretraining and task-specific fine-tuning. Without requiring native labeled data, PUMPS achieves state-of-the-art performance directly after pretraining. After fine-tuning, it consistently outperforms dedicated methods on downstream tasks—including motion denoising and action estimation—demonstrating superior generalizability and structural universality across diverse skeletal configurations and motion dynamics.
📝 Abstract
Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.