🤖 AI Summary
This paper addresses the challenge of single-shot functional region localization for deformable objects—characterized by unknown material properties, highly variable shapes, and strong visual clutter—in first-person robotic rearrangement tasks. We propose an end-to-end differentiable vision-language grounding framework. Key contributions include: (1) a Deformable Semantic Enhancement Module (DefoSEM) that jointly incorporates hierarchical semantic priors and geometric constraints; (2) an ORB-enhanced Keypoint Fusion Module (OEKFM) for robust local feature matching; and (3) an instance-conditioned prompting mechanism leveraging both image and task context to enable weakly supervised, robust functional grounding. Evaluated on our newly constructed real-world dataset AGDDO15, our method achieves state-of-the-art performance, outperforming prior approaches by 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively—demonstrating significantly improved generalization and robustness under challenging conditions.
📝 Abstract
Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset will be publicly available at https://github.com/Dikay1/OS-AGDO.