🤖 AI Summary
To address the limitations of category-level 6D pose estimation methods—namely, their reliance on depth data and poor deployability in real-world RGB-only scenarios—this paper proposes an end-to-end, RGB-only approach. Our method introduces three key innovations: (1) a geometry-aware Transformer network that explicitly encodes 3D structural priors of object categories; (2) a learnable geometric feature guidance mechanism to enhance robustness against scale variation and occlusion; and (3) integration with a lightweight RANSAC-PnP solver for accurate pose regression. Evaluated on standard benchmarks including NOCS and Occlusion LINEMOD, our method significantly outperforms existing RGB-only approaches in both pose accuracy and inference efficiency, achieving a superior trade-off between the two. These results empirically validate the feasibility and practicality of category-level 6D pose estimation without depth input.
📝 Abstract
While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object's geometry, we introduce a geometric feature-guided algorithm, which enhances the network's ability to effectively represent the object's geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object's pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.