A General One-Shot Multimodal Active Perception Framework for Robotic Manipulation: Learning to Predict Optimal Viewpoint

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a general-purpose, one-shot multimodal active perception framework that overcomes the limitations of existing approaches, which rely on task-specific iterative optimization and suffer from high computational cost and poor generalization. The key innovation lies in decoupling viewpoint quality assessment from the overall architecture for the first time, enabling a cross-attention-based multimodal network to directly predict optimal camera viewpoints. Trained on a large-scale dataset constructed through systematic viewpoint sampling and domain randomization, the method supports cross-task transfer and zero-shot simulation-to-reality (sim-to-real) deployment. Evaluated on a constrained-view grasping task, it nearly doubles real-world grasping success rates and achieves seamless sim-to-real transfer without fine-tuning.

Technology Category

Application Category

📝 Abstract
Active perception in vision-based robotic manipulation aims to move the camera toward more informative observation viewpoints, thereby providing high-quality perceptual inputs for downstream tasks. Most existing active perception methods rely on iterative optimization, leading to high time and motion costs, and are tightly coupled with task-specific objectives, which limits their transferability. In this paper, we propose a general one-shot multimodal active perception framework for robotic manipulation. The framework enables direct inference of optimal viewpoints and comprises a data collection pipeline and an optimal viewpoint prediction network. Specifically, the framework decouples viewpoint quality evaluation from the overall architecture, supporting heterogeneous task requirements. Optimal viewpoints are defined through systematic sampling and evaluation of candidate viewpoints, after which large-scale training datasets are constructed via domain randomization. Moreover, a multimodal optimal viewpoint prediction network is developed, leveraging cross-attention to align and fuse multimodal features and directly predict camera pose adjustments. The proposed framework is instantiated in robotic grasping under viewpoint-constrained environments. Experimental results demonstrate that active perception guided by the framework significantly improves grasp success rates. Notably, real-world evaluations achieve nearly double the grasp success rate and enable seamless sim-to-real transfer without additional fine-tuning, demonstrating the effectiveness of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

active perception
robotic manipulation
one-shot
viewpoint optimization
multimodal
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-shot active perception
multimodal fusion
cross-attention
sim-to-real transfer
viewpoint prediction
🔎 Similar Papers
No similar papers found.
D
Deyun Qin
Institute of Robotics and Automatic Information Systems, College of Artificial Intelligence, Nankai University, Tianjin 300350, China; and Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin 300350, China
Z
Zezhi Liu
Institute of Robotics and Automatic Information Systems, College of Artificial Intelligence, Nankai University, Tianjin 300350, China; and Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin 300350, China
H
Hanqian Luo
College of Artificial Intelligence, Nankai University, Tianjin 300350, China; and Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
X
Xiao Liang
Institute of Robotics and Automatic Information Systems, College of Artificial Intelligence, Nankai University, Tianjin 300350, China; and Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin 300350, China
Yongchun Fang
Yongchun Fang
Nankai University
Visual ServoingNonlinear ControlAtomic Force Microscope