🤖 AI Summary
Existing methods struggle to jointly model geometry, appearance, and physical properties for building high-fidelity, simulatable, and renderable world models from a single real-world robotic interaction sequence—comprising both visual and tactile observations—in novel environments.
Method: We propose the first end-to-end jointly optimized framework: a differentiable point-based geometric representation captures scene structure; a voxelized appearance field enables photorealistic rendering; and differentiable collision detection coupled with physics simulation ensures dynamical consistency.
Contribution/Results: Our approach achieves the first rigid-body unified representation and co-optimization of geometry, appearance, and physics. Experiments demonstrate that a single real interaction suffices to reconstruct a world model capable of both forward simulation and real-time rendering. The resulting model significantly outperforms state-of-the-art single-modality methods in fidelity and cross-environment generalization.
📝 Abstract
Identifying predictive world models for robots in novel environments from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable programming to identify world models are incapable of jointly optimizing the geometry, appearance, and physical properties of the scene. In this work, we introduce a novel rigid object representation that allows the joint identification of these properties. Our method employs a novel differentiable point-based geometry representation coupled with a grid-based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end-to-end optimization of world models, given the sparse visual and tactile observations of a physical motion sequence. Through a series of world model identification tasks in simulated and real environments, we show that our method can learn both simulation- and rendering-ready world models from only one robot action sequence.