🤖 AI Summary
Existing 3D point cloud affordance detection methods rely on coarse-grained, embedding-level cosine similarity, failing to capture fine-grained semantic alignment between point clouds and textual affordance descriptions—thereby limiting interactive region localization accuracy. To address this, we propose LM-AD, a language-model-guided affordance detection framework featuring an Affordance Query Module (AQM). AQM tightly integrates a pre-trained language model with a 3D point cloud encoder via cross-modal attention, enabling pixel-level point-word alignment and explicitly modeling fine-grained associations between object surface geometry and functional verbs (e.g., “grasp”, “press”). Evaluated on the 3D AffordanceNet benchmark, LM-AD achieves significant improvements in detection accuracy and mean Intersection-over-Union (mIoU), outperforming state-of-the-art methods by an average of 4.2% mIoU. To our knowledge, it is the first method to enable precise, semantics-driven localization of functional regions in 3D point clouds.
📝 Abstract
In this work, we address the challenge of affordance detection in 3D point clouds, a task that requires effectively capturing fine-grained alignments between point clouds and text. Existing methods often struggle to model such alignments, resulting in limited performance on standard benchmarks. A key limitation of these approaches is their reliance on simple cosine similarity between point cloud and text embeddings, which lacks the expressiveness needed for fine-grained reasoning. To address this limitation, we propose LM-AD, a novel method for affordance detection in 3D point clouds. Moreover, we introduce the Affordance Query Module (AQM), which efficiently captures fine-grained alignment between point clouds and text by leveraging a pretrained language model. We demonstrated that our method outperformed existing approaches in terms of accuracy and mean Intersection over Union on the 3D AffordanceNet dataset.