FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generalization in dexterous hand grasp synthesis, which arises from the high-dimensional joint space and hardware heterogeneity across different hand morphologies. To overcome this, the authors propose a cross-hand grasp generation method that requires no hardware-specific training data. By leveraging a pretrained diffusion model to extract fine-grained semantic graspability from human demonstration videos and integrating geometric cues from monocular depth images, the approach constructs a vision-language-driven grasp prior. A kinematics-aware retargeting module then maps this prior to diverse hand structures. The method consistently generates stable, functionally plausible multi-contact grasps on unseen objects, novel poses, and various hand types, significantly enhancing both generalization capability and hardware-agnostic performance.

Technology Category

Application Category

📝 Abstract
Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.
Problem

Research questions and friction points this paper is trying to address.

dexterous grasp synthesis
affordance grounding
cross-hand generalization
data-efficient manipulation
semantic priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

dexterous grasping
diffusion models
affordance grounding
cross-hand generalization
foundation models
🔎 Similar Papers
No similar papers found.
Y
Yifan Han
Institute of Automation, Chinese Academy of Sciences.; School of Artificial Intelligence, University of Chinese Academy of Sciences.
P
Pengfei Yi
Institute of Automation, Chinese Academy of Sciences.; School of Artificial Intelligence, University of Chinese Academy of Sciences.
Junyan Li
Junyan Li
UMass Amherst
Foundation ModelsEfficient AI
Hanqing Wang
Hanqing Wang
HUST ➡ Shanghai AI lab ➡ HKUST(gz)
MLLMEmbodied AIWorld ModelVLA
Gaojing Zhang
Gaojing Zhang
M.S. student
SLAMEnvironment Awareness
Q
Qi Peng Liu
School of Artificial Intelligence, Shanghai Jiao Tong University
Wenzhao Lian
Wenzhao Lian
Google X
Roboticsmachine learning