ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing video prediction methods (e.g., ViPro) rely on ground-truth initial symbolic states, inducing learning shortcuts and undermining robust latent state estimation from raw observations. This work proposes the first unsupervised video prediction framework that requires no initial ground-truth symbolic states. By unifying dynamical modeling with variational inference, it jointly optimizes a differentiable symbolic system and deep neural networks to enable end-to-end inference of dynamic latent states and future frame prediction directly from raw observations. A key innovation is the explicit decoupling of observation space from symbolic representation, eliminating dependence on annotated symbolic states. To rigorously evaluate generalization and robustness under noisy observations, we introduce the challenging 3D Orbits dataset. Experiments demonstrate that our method achieves high-fidelity frame prediction and accurate latent state estimation simultaneously—fully unsupervised—outperforming prior approaches in both fidelity and robustness.

Technology Category

Application Category

📝 Abstract

Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Predict future video frames accurately

Infer states from noisy observations

Enable unsupervised state estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised state estimation from observations

Integrated dynamics for video prediction

Extended 3D dataset for real-world scenarios

🔎 Similar Papers

Collaboratively Self-supervised Video Representation Learning for Action Recognition