🤖 AI Summary
This work proposes a method for 6D pose tracking and 3D reconstruction of multiple rigid objects from monocular RGB-D video sequences without requiring CAD models or category-level priors. Starting from sparse image point initializations, the approach leverages a 2D point tracker to establish long-range correspondences, enabling causal, real-time, model-free multi-object pose tracking while simultaneously building an online TSDF-based volumetric representation incrementally. Notably, it is the first model-free framework capable of immediate recovery of multiple objects after complete occlusion. The method achieves performance on par with state-of-the-art approaches under severe occlusion scenarios, while maintaining real-time operation, scalability to multiple objects, and the unique advantage of jointly estimating object poses and reconstructing their geometry.
📝 Abstract
We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.