🤖 AI Summary
Ambiguous natural language instructions often obscure the intended robotic grasping actions, leading to insufficient precision in component-level manipulation. Method: We propose a language-driven, component-level grasping framework that integrates a fine-tuned large language model (LLM) with 2D part segmentation–guided partial point cloud localization. The LLM decodes implicit semantic intent from instructions, while 2D segmentation provides part-level priors to guide fine-grained spatial localization in point clouds; an environment-aware fusion algorithm further dynamically generates high-accuracy grasping poses. Results: Experiments demonstrate that our framework accurately identifies key operations and target components from ambiguous instructions in unstructured environments, significantly improving component-level grasp success rates and task adaptability. It establishes a novel paradigm for language–action co-reasoning in embodied intelligence.
📝 Abstract
The existing language-driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language-interactive robotic grasping framework. The framework integrates fine-tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse-grained object-level to fine-grained part-level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high-precision grasping from object-level to part-level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.