CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in tabletop object grasping—namely, the scarcity of multimodal data, spatial hallucinations caused by inaccurate geometric localization, and poor robustness in open-loop execution—by introducing an asynchronous closed-loop spatial perception framework. The proposed approach decouples semantic intent from geometric localization, compares pre- and post-execution states, and automatically generates multimodal training data. It integrates a vision-language model, a dual-pathway hierarchical perception module, and an asynchronous closed-loop evaluation mechanism to achieve high-precision, open-vocabulary grasping without requiring human teleoperation. Experimental results demonstrate that the system attains an overall success rate of 87.0% in complex, cluttered scenes, significantly outperforming existing methods and effectively narrowing the sim-to-real gap.

Technology Category

Application Category

📝 Abstract

Robot grasping of desktop object is widely used in intelligent manufacturing, logistics, and agriculture.Although vision-language models (VLMs) show strong potential for robotic manipulation, their deployment in low-level grasping faces key challenges: scarce high-quality multimodal demonstrations, spatial hallucination caused by weak geometric grounding, and the fragility of open-loop execution in dynamic environments. To address these challenges, we propose Closed-Loop Asynchronous Spatial Perception(CLASP), a novel asynchronous closed-loop framework that integrates multimodal perception, logical reasoning, and state-reflective feedback. First, we design a Dual-Pathway Hierarchical Perception module that decouples high-level semantic intent from geometric grounding. The design guides the output of the inference model and the definite action tuples, reducing spatial illusions. Second, an Asynchronous Closed-Loop Evaluator is implemented to compare pre- and post-execution states, providing text-based diagnostic feedback to establish a robust error-correction loop and improving the vulnerability of traditional open-loop execution in dynamic environments. Finally, we design a scalable multi-modal data engine that automatically synthesizes high-quality spatial annotations and reasoning templates from real and synthetic scenes without human teleoperation. Extensive experiments demonstrate that our approach significantly outperforms existing baselines, achieving an 87.0% overall success rate. Notably, the proposed framework exhibits remarkable generalization across diverse objects, bridging the sim-to-real gap and providing exceptional robustness in geometrically challenging categories and cluttered scenarios.

Problem

Research questions and friction points this paper is trying to address.

robot grasping

vision-language models

spatial hallucination

open-loop execution

multimodal demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop perception

Asynchronous feedback

Geometric grounding

Vision-language models

Open-vocabulary grasping

🔎 Similar Papers

No similar papers found.

Authors to Follow