GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of robotic grasping under partial observability, where existing methods struggle to jointly leverage local contact cues and complete 3D object geometry, limiting reliability. To overcome this, we propose GraspFoM, a unified framework that, for the first time, incorporates a 3D foundation model prior (SAM3D) to construct a shared implicit object representation, simultaneously enabling high-fidelity reconstruction and multimodal grasp pose prediction. Our approach introduces an anchor-initialized truncated pose diffusion inference module, a reconstruction-aware scoring function, and a residual implicit update mechanism to achieve bidirectional co-optimization between reconstruction and grasping. Experiments demonstrate state-of-the-art performance in both reconstruction quality and grasp success rate, with only a small number of trainable parameters; ablation studies further confirm the effectiveness of each component.
📝 Abstract
Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.
Problem

Research questions and friction points this paper is trying to address.

robotic grasping
partial observations
3D reconstruction
object-level geometry
grasp affordances
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D foundation model
shared object latent
pose-reasoning diffuser
reconstruction-aware grasping
multimodal grasp prediction