🤖 AI Summary
Existing video prediction methods (e.g., ViPro) rely on ground-truth initial symbolic states, inducing learning shortcuts and undermining robust latent state estimation from raw observations. This work proposes the first unsupervised video prediction framework that requires no initial ground-truth symbolic states. By unifying dynamical modeling with variational inference, it jointly optimizes a differentiable symbolic system and deep neural networks to enable end-to-end inference of dynamic latent states and future frame prediction directly from raw observations. A key innovation is the explicit decoupling of observation space from symbolic representation, eliminating dependence on annotated symbolic states. To rigorously evaluate generalization and robustness under noisy observations, we introduce the challenging 3D Orbits dataset. Experiments demonstrate that our method achieves high-fidelity frame prediction and accurate latent state estimation simultaneously—fully unsupervised—outperforming prior approaches in both fidelity and robustness.
📝 Abstract
Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.