ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Visual reinforcement learning policies for robotic manipulation are highly sensitive to camera viewpoint variations in real-world deployment, often leading to failure. Existing approaches either rely on precise camera calibration or lack generalization under large viewpoint shifts. This paper proposes a viewpoint-invariant 3D representation learning framework: it employs a lightweight ViewNet module to achieve calibration-free point cloud spatial alignment; integrates self-supervised disentangled feature learning with differentiable GPU rendering to construct an end-to-end trainable model. Evaluated on ten simulated and five real-world robotic tasks, the method achieves a 44.7% higher success rate than state-of-the-art approaches, reduces parameter count by 80%, and significantly improves viewpoint robustness and sim-to-real transferability.

Technology Category

Application Category

📝 Abstract

Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted--an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 44.7% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments. Our project website can be found in https://zheng-joe-lee.github.io/manivid3d/.

Problem

Research questions and friction points this paper is trying to address.

Achieving view-invariant robotic manipulation under camera viewpoint changes

Eliminating dependency on precise camera calibration for visual RL

Learning geometrically consistent 3D representations for sim-to-real transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

ViewNet module aligns point clouds without calibration

GPU-accelerated rendering processes 5000 frames per second

Disentangled 3D representations enable view-invariant manipulation learning

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey