DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of insufficient joint modeling of 3D geometry, dynamics, and task semantics—and scarce real-world training data—in robotic manipulation policy learning, this paper proposes a 3D-perceptive dynamics representation pretraining framework leveraging multi-view RGB-D videos. Our method introduces masked future volume rendering—a novel technique in representation learning—that unifies spatial structure, semantic intent, and physical dynamics. It integrates differentiable volume rendering, triplane representations, masked reconstruction, and future-frame prediction, while enabling policy transfer via action-value maps. Evaluated on RLBench, Colosseum, and real robotic platforms, our approach significantly improves manipulation success rates, robustness to environmental perturbations, and cross-task generalization.

Technology Category

Application Category

📝 Abstract
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Learning generalizable robotic manipulation policies with limited real-world data
Jointly capturing 3D geometry, semantics and dynamics for manipulation
Transferring learned representations to downstream robotic manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns 3D-aware dynamics via masked future rendering
Uses differentiable volumetric rendering for triplane features
Transfers representations via action value map prediction
🔎 Similar Papers
No similar papers found.
J
Jingyi Tian
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
S
Sen Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jiayi Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Gang Hua
Gang Hua
Director of Applied Science, AI, Amazon.com, Inc., IEEE & IAPR Fellow
Computer VisionMachine LearningArtificial IntelligenceRoboticsMultimedia