TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing template-based methods for 3D hand-object interaction reconstruction suffer from poor generalizability, high manual annotation cost, and weak robustness to occlusion. To address these limitations, this paper proposes the first template-free, text-guided generation and refinement framework. Methodologically, we introduce a novel text-instruction-driven shape prior generation mechanism that integrates CLIP’s text encoder with a diffusion model; further, we design a 2D-3D co-attention refinement module that jointly optimizes object pose and shape, enabling flexible integration of heterogeneous priors—including text-generated, manually modeled, and prototype-retrieved shapes. Quantitative evaluation on Dex-YCB and ObMan yields object Chamfer distances of 1.979 mm and 5.468 mm, respectively—outperforming all existing template-free approaches. Moreover, our method demonstrates strong occlusion robustness and favorable deployability in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of pre-defined 3D object templates in hand-object interaction reconstruction
Enhancing adaptability to unconstrained scenarios like occluded objects via text instructions
Bridging geometric gaps between synthesized prototypes and real objects through refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-instructed prior generation for 3D shapes
2D-3D collaborative attention for refinement
Template-free hand-object interaction framework
🔎 Similar Papers
No similar papers found.