DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of insufficient joint modeling of 3D geometry, dynamics, and task semantics—and scarce real-world training data—in robotic manipulation policy learning, this paper proposes a 3D-perceptive dynamics representation pretraining framework leveraging multi-view RGB-D videos. Our method introduces masked future volume rendering—a novel technique in representation learning—that unifies spatial structure, semantic intent, and physical dynamics. It integrates differentiable volume rendering, triplane representations, masked reconstruction, and future-frame prediction, while enabling policy transfer via action-value maps. Evaluated on RLBench, Colosseum, and real robotic platforms, our approach significantly improves manipulation success rates, robustness to environmental perturbations, and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Learning generalizable robotic manipulation policies with limited real-world data

Jointly capturing 3D geometry, semantics and dynamics for manipulation

Transferring learned representations to downstream robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns 3D-aware dynamics via masked future rendering

Uses differentiable volumetric rendering for triplane features

Transfers representations via action value map prediction

🔎 Similar Papers

No similar papers found.

Authors to Follow