🤖 AI Summary
This work addresses the problem of category-level pose estimation for deformable objects from a single RGB-D image without requiring dense supervision, CAD templates, or multi-view inputs. The authors propose a self-supervised method that employs an SE(3)-equivariant vector neuron autoencoder to align observations into a canonical space, coupled with a joint-aware linear blend skinning module to jointly recover shared category-level geometry, rigid part segmentation, and explicit joint parameters—including rotation axes, pivot points, and articulation poses. This approach is the first to achieve explicit joint modeling and geometry-motion disentanglement for category-level deformable objects under fully unsupervised conditions. It demonstrates state-of-the-art performance on both synthetic and real-world datasets, significantly outperforming existing self-supervised methods.
📝 Abstract
Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.