GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current language-driven grasping detection methods face two key bottlenecks: (1) difficulty in parsing complex, ambiguous natural-language instructions and (2) poor robustness in cluttered, dynamic environments—further hindered by reliance on domain-specific fine-tuning, limiting generalization. To address these, we propose a zero-shot multi-agent framework—Planner-Coder-Observer—that enables cross-domain language-driven grasping without any parameter updates or training. The Planner performs semantic parsing and high-level task decomposition; the Coder synthesizes executable robotic control code; and the Observer leverages real-time visual feedback for closed-loop execution refinement. By synergistically integrating large language models, program synthesis, and environment-aware perception, our approach significantly improves ambiguity resolution and adaptability to dynamic scenes. Evaluated on two large-scale benchmarks, it surpasses state-of-the-art methods and achieves >89% success rates on both simulation and real-world robotic platforms, demonstrating strong zero-shot generalization.

Technology Category

Application Category

📝 Abstract
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large-scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Interpreting complex text instructions for robot grasping
Operating effectively in densely cluttered environments
Eliminating training needs for new domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for grasp detection
Zero-shot language-driven approach
Specialized Planner, Coder, Observer agents
🔎 Similar Papers
No similar papers found.
Q
Quang Nguyen
FPT Software AI Center, Vietnam
Tri Le
Tri Le
FPT AI Center
H
Huy Nguyen
Automation & Control Institute (ACIN), TU Wien, Austria
T
Thieu Vo
Department of Mathematics, NUS, Singapore
Tung D. Ta
Tung D. Ta
The University of Tokyo
RoboticsHuman Computer InteractionDigital Fabrication
Baoru Huang
Baoru Huang
University of Liverpool; Imperial College London
RoboticsComputer visionSurgical visionImage-Guided Intervention
M
Minh N. Vu
Automation & Control Institute (ACIN), TU Wien, Austria
A
Anh Nguyen
Department of Computer Science, University of Liverpool, UK