CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in tabletop object grasping—namely, the scarcity of multimodal data, spatial hallucinations caused by inaccurate geometric localization, and poor robustness in open-loop execution—by introducing an asynchronous closed-loop spatial perception framework. The proposed approach decouples semantic intent from geometric localization, compares pre- and post-execution states, and automatically generates multimodal training data. It integrates a vision-language model, a dual-pathway hierarchical perception module, and an asynchronous closed-loop evaluation mechanism to achieve high-precision, open-vocabulary grasping without requiring human teleoperation. Experimental results demonstrate that the system attains an overall success rate of 87.0% in complex, cluttered scenes, significantly outperforming existing methods and effectively narrowing the sim-to-real gap.

Technology Category

Application Category

📝 Abstract
Robot grasping of desktop object is widely used in intelligent manufacturing, logistics, and agriculture.Although vision-language models (VLMs) show strong potential for robotic manipulation, their deployment in low-level grasping faces key challenges: scarce high-quality multimodal demonstrations, spatial hallucination caused by weak geometric grounding, and the fragility of open-loop execution in dynamic environments. To address these challenges, we propose Closed-Loop Asynchronous Spatial Perception(CLASP), a novel asynchronous closed-loop framework that integrates multimodal perception, logical reasoning, and state-reflective feedback. First, we design a Dual-Pathway Hierarchical Perception module that decouples high-level semantic intent from geometric grounding. The design guides the output of the inference model and the definite action tuples, reducing spatial illusions. Second, an Asynchronous Closed-Loop Evaluator is implemented to compare pre- and post-execution states, providing text-based diagnostic feedback to establish a robust error-correction loop and improving the vulnerability of traditional open-loop execution in dynamic environments. Finally, we design a scalable multi-modal data engine that automatically synthesizes high-quality spatial annotations and reasoning templates from real and synthetic scenes without human teleoperation. Extensive experiments demonstrate that our approach significantly outperforms existing baselines, achieving an 87.0% overall success rate. Notably, the proposed framework exhibits remarkable generalization across diverse objects, bridging the sim-to-real gap and providing exceptional robustness in geometrically challenging categories and cluttered scenarios.
Problem

Research questions and friction points this paper is trying to address.

robot grasping
vision-language models
spatial hallucination
open-loop execution
multimodal demonstrations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop perception
Asynchronous feedback
Geometric grounding
Vision-language models
Open-vocabulary grasping
🔎 Similar Papers
No similar papers found.
Y
Yiran Ling
Harbin Institute of Technology; National Key Laboratory of Smart Farm Technologies and Systems
Wenxuan Li
Wenxuan Li
Johns Hopkins University
Imaging InformaticsComputer-aided Diagnosis
Siying Dong
Siying Dong
Facebook
DatabasesDistributed Systems
Y
Yize Zhang
Harbin Institute of Technology; National Key Laboratory of Smart Farm Technologies and Systems
Xiaoyao Huang
Xiaoyao Huang
China Telecom
J
Jing Jiang
Harbin Institute of Technology; National Key Laboratory of Smart Farm Technologies and Systems
R
Ruonan Li
Peng Cheng Laboratory
Jie Liu
Jie Liu
Harbin Institute of Technology
Computer Science and Engineering