🤖 AI Summary
To address the accuracy and generalization bottlenecks in 6-DoF grasp pose estimation from a single RGB image—caused by limited visual cues and object complexity—this paper proposes Tri-Plane Gaussian Mixture (TP-GM), a lightweight, end-to-end, differentiable 3D representation for real-time grasp inference. Our method integrates a tri-plane decoder with a point-cloud-driven grasp distribution generation mechanism to directly regress 6-DoF grasp poses; parallel gripper modeling is further introduced to enhance geometric plausibility. TP-GM enables zero-shot cross-object generalization and achieves millisecond-level inference speed on everyday objects. Experimental results demonstrate significantly higher grasp success rates than current state-of-the-art methods. By unifying compact geometric representation with task-specific differentiable optimization, TP-GM establishes a novel, efficient, and robust paradigm for single-image-driven robotic grasping.
📝 Abstract
Reliable object grasping is one of the fundamental tasks in robotics. However, determining grasping pose based on single-image input has long been a challenge due to limited visual information and the complexity of real-world objects. In this paper, we propose Triplane Grasping, a fast grasping decision-making method that relies solely on a single RGB-only image as input. Triplane Grasping creates a hybrid Triplane-Gaussian 3D representation through a point decoder and a triplane decoder, which produce an efficient and high-quality reconstruction of the object to be grasped to meet real-time grasping requirements. We propose to use an end-to-end network to generate 6-DoF parallel-jaw grasp distributions directly from 3D points in the point cloud as potential grasp contacts and anchor the grasp pose in the observed data. Experiments demonstrate that our method achieves rapid modeling and grasping pose decision-making for daily objects, and exhibits a high grasping success rate in zero-shot scenarios.