🤖 AI Summary
This work addresses the challenge of 6D object pose estimation under pose ambiguity and outlier sensitivity, particularly for symmetric objects or those lacking distinctive local features. To mitigate these issues, we propose a novel approach that, for the first time, integrates appearance-based semantic features into a conditional flow matching framework, formulating pose estimation as a conditional generative denoising process in ℝ³. By jointly optimizing local geometric and semantic information, our method effectively resolves ambiguities induced by object symmetry. Furthermore, we incorporate RANSAC to achieve robust pose registration. Evaluated on five datasets from the BOP benchmark, our approach achieves an average recall improvement of 4.5% over state-of-the-art methods, demonstrating significant performance gains.
📝 Abstract
Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\mathbb{R}^3$. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : https://tev-fbk.github.io/Flose/