🤖 AI Summary
This work addresses the challenge of learning physically plausible 3D dynamics directly from multi-view RGB videos, without explicit articulation constraints or ground-truth 3D supervision. The proposed method encodes input videos into a dynamic 3D Gaussian particle representation and models point-wise latent-space dynamics via a spatiotemporally encoded Transformer. It jointly optimizes motion and illumination under an inverse rendering objective and renders high-fidelity frames using 3D Gaussian splatting. Crucially, the framework implicitly learns per-particle physical attributes—including mass, elasticity, and friction—enabling unified simulation of rigid bodies, elastic deformables, and cloth-like materials. This implicit physical parameterization significantly improves generalization to unseen multi-body interactions and novel scene editing, while preserving realistic lighting effects. Experiments demonstrate enhanced simulatability, editability, and physical fidelity compared to prior learning-based approaches.
📝 Abstract
Learning physics simulations from video data requires maintaining spatial and temporal consistency, a challenge often addressed with strong inductive biases or ground-truth 3D information -- limiting scalability and generalization. We introduce 3DGSim, a 3D physics simulator that learns object dynamics end-to-end from multi-view RGB videos. It encodes images into a 3D Gaussian particle representation, propagates dynamics via a transformer, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSimembeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints. This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.