FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of generalization in dexterous hand grasp synthesis, which arises from the high-dimensional joint space and hardware heterogeneity across different hand morphologies. To overcome this, the authors propose a cross-hand grasp generation method that requires no hardware-specific training data. By leveraging a pretrained diffusion model to extract fine-grained semantic graspability from human demonstration videos and integrating geometric cues from monocular depth images, the approach constructs a vision-language-driven grasp prior. A kinematics-aware retargeting module then maps this prior to diverse hand structures. The method consistently generates stable, functionally plausible multi-contact grasps on unseen objects, novel poses, and various hand types, significantly enhancing both generalization capability and hardware-agnostic performance.

Technology Category

Application Category

📝 Abstract

Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.

Problem

Research questions and friction points this paper is trying to address.

dexterous grasp synthesis

affordance grounding

cross-hand generalization

data-efficient manipulation

semantic priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

dexterous grasping

diffusion models

affordance grounding

cross-hand generalization

foundation models

🔎 Similar Papers

No similar papers found.

Authors to Follow