ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Reconstructing interactive dynamic 3D scenes from monocular robot egocentric videos—particularly those involving complex contacts and abrupt pose changes—remains challenging. This work proposes a unified framework that models the robot, objects, and background as independent subfields through decoupled 3D Gaussian splatting. It introduces a graph-structured representation learning approach coupled with a task-oriented spatiotemporal alignment mechanism. By integrating motion- or skill-phase segmentation with joint photometric-geometric optimization, the method achieves high-fidelity, controllable, and simulation-ready dynamic digital twins. Evaluated on real-world data, the approach demonstrates strong effectiveness and robustness, enabling downstream applications in robotic task planning and policy learning.

📝 Abstract

Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

Problem

Research questions and friction points this paper is trying to address.

3D scene reconstruction

manipulation trajectory

monocular video

interactive environments

dynamic scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

Decoupled Representation

Monocular Video