A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of insufficient robustness in six-degree-of-freedom grasp pose estimation under occlusions from corner views. To overcome this limitation without relying on multi-view reconstruction, the authors propose a post-fusion approach that leverages an auxiliary viewpoint. Central to their method is a cylindrical coordinate fusion module designed for cross-view alignment, augmented with self-supervised contrastive learning and a local-to-seed cross-attention mechanism. This combination enhances spatial consistency and directional discriminability of point cloud features. Evaluated on the GraspNet-1Billion benchmark and real-world scenarios, the proposed method demonstrates significant improvements in both robustness and accuracy of grasp pose estimation.

📝 Abstract

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

Problem

Research questions and friction points this paper is trying to address.

6-DoF grasp pose estimation

occlusion

corner views

cross-view fusion

grasp robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view fusion

6-DoF grasp pose estimation

self-supervised contrastive learning