HOGraspFlow: Exploring Vision-based Generative Grasp Synthesis with Hand-Object Priors and Taxonomy Awareness

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of generating multimodal parallel-jaw grasping poses from a single RGB image without explicit object geometry priors. We propose a visual generative method centered on manipulability, integrating hand-object contact reconstruction, category-aware priors, and foundation-model visual features—eliminating reliance on explicit 3D geometric inputs. Our approach employs denoising flow matching to generate SE(3) grasping poses conditioned on RGB semantics, contact structure, and grasp type, thereby preserving multimodality while enhancing geometric plausibility and physical feasibility. Evaluated in real-world settings, our method achieves an average grasp success rate of 83.2%, significantly outperforming existing diffusion-based variants. This demonstrates strong cross-object generalization and robustness for practical deployment.

Technology Category

Application Category

📝 Abstract
We propose Hand-Objectemph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize $SE(3)$ grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that emph{HOGraspFlow} consistently outperforms diffusion-based variants (emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in $SE(3)$. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over $83%$ is achieved.
Problem

Research questions and friction points this paper is trying to address.

Retargets single RGB images into multi-modal parallel jaw grasps
Synthesizes SE(3) grasp poses without explicit geometric object priors
Achieves object-agnostic grasp synthesis from human demonstrations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative grasp synthesis using denoising flow matching
Conditioned on RGB features, contact reconstruction, taxonomy
Object-agnostic method outperforms diffusion-based grasp synthesis variants
🔎 Similar Papers
No similar papers found.