TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing motion capture methods are limited by fixed skeletal templates or reliance on cumbersome manual rigging, hindering generalization to characters with arbitrary topologies. This work proposes TopoCap, a unified framework that, for the first time, enables motion extraction from monocular video and zero-shot retargeting to any unknown skeletal topology—including bipeds, hexapods, and even inanimate objects—without test-time optimization. The approach leverages a graph-conditioned variational autoencoder to learn a universal motion prior and combines structural embeddings with conditional flow matching to map visual inputs to topology-agnostic motion codes. TopoCap outperforms specialized models on both human and quadruped benchmarks and successfully animates long-tail 3D characters in a zero-shot setting. The study also introduces Mobjaverse, a large-scale dataset encompassing over 5,000 distinct topologies and 2 million motion frames.

📝 Abstract

The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at https://huggingface.co/datasets/duckduckplz/Mobjaverse.

Problem

Research questions and friction points this paper is trying to address.

motion retargeting

skeletal topology

monocular video

animation

3D characters

Innovation

Methods, ideas, or system contributions that make the work stand out.

topology-agnostic motion

universal motion manifold

graph CVAE