GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments

๐Ÿ“… 2025-11-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address robotic grasping failures in cluttered scenes caused by occlusion, poor perceptual quality, and inconsistent 3D reconstruction, this paper proposes a monocular RGB-only active perception grasping framework. Methodologically, it integrates single-view geometric reconstruction with multi-view fusion for global scene modeling; introduces a rendering-based grasp scoring mechanism to guide active viewpoint selection; and designs an online metric alignment module to ensure scale-consistent grasp planning. Its key contribution lies in eliminating reliance on depth sensorsโ€”instead, it establishes a robust closed-loop active perception pipeline combining VGGT-based pose estimation, GraspNet-based grasp proposal generation, and rendering-driven viewpoint optimization. Experiments demonstrate that the method significantly outperforms both RGB-D and single-view RGB baselines under challenging conditions, including severe occlusion, close-range manipulation, and grasping of transparent or specular objects.

Technology Category

Application Category

๐Ÿ“ Abstract
Robotic grasping is a fundamental capability for autonomous manipulation, yet remains highly challenging in cluttered environments where occlusion, poor perception quality, and inconsistent 3D reconstructions often lead to unstable or failed grasps. Conventional pipelines have widely relied on RGB-D cameras to provide geometric information, which fail on transparent or glossy objects and degrade at close range. We present GraspView, an RGB-only robotic grasping pipeline that achieves accurate manipulation in cluttered environments without depth sensors. Our framework integrates three key components: (i) global perception scene reconstruction, which provides locally consistent, up-to-scale geometry from a single RGB view and fuses multi-view projections into a coherent global 3D scene; (ii) a render-and-score active perception strategy, which dynamically selects next-best-views to reveal occluded regions; and (iii) an online metric alignment module that calibrates VGGT predictions against robot kinematics to ensure physical scale consistency. Building on these tailor-designed modules, GraspView performs best-view global grasping, fusing multi-view reconstructions and leveraging GraspNet for robust execution. Experiments on diverse tabletop objects demonstrate that GraspView significantly outperforms both RGB-D and single-view RGB baselines, especially under heavy occlusion, near-field sensing, and with transparent objects. These results highlight GraspView as a practical and versatile alternative to RGB-D pipelines, enabling reliable grasping in unstructured real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Addresses robotic grasping challenges in cluttered environments with occlusions
Overcomes limitations of RGB-D sensors on transparent and glossy objects
Eliminates dependency on depth sensors through RGB-only perception pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-only grasping pipeline without depth sensors
Active perception strategy for next-best-view selection
Online metric alignment for physical scale consistency
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Sheng Wang
Peng Cheng Laboratory, Shenzhen, China
M
Mingtong Dai
Peng Cheng Laboratory, Shenzhen, China; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; University of Chinese Academy of Sciences, Beijing, China
Jingxuan Su
Jingxuan Su
School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Visual PerceptionMulti-Modal LearningQuality Assessment
L
Lingbo Liu
Peng Cheng Laboratory, Shenzhen, China; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; X-Era AI Lab, China
C
Chunjie Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
X
Xinyu Wu
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis