PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses robot world modeling from unlabeled video data. Method: We propose an object-centric, controllable video prediction and planning framework that jointly learns interpretable implicit action representations and disentangled object dynamics from action-free videos—integrating slot-based object representations, variational latent modeling, inverse dynamics learning, and conditional latent action sampling. Contribution/Results: Our approach enables user-controllable multi-future trajectory prediction, supports backward action inference and policy generation, and closes the loop from video understanding to robot policy learning. Crucially, it achieves sample-efficient policy learning using only raw video input—without action labels or environment instrumentation. Experiments across multiple simulated environments demonstrate significant improvements over random baselines and state-of-the-art object-centric models in both prediction accuracy and downstream control performance.

Technology Category

Application Category

📝 Abstract
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available at https://play-slot.github.io/PlaySlot/.
Problem

Research questions and friction points this paper is trying to address.

Infer latent actions from unlabeled video sequences.
Predict future object states and video frames.
Enable versatile and interpretable world modeling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric video prediction
infers latent actions
learns from unlabeled videos
🔎 Similar Papers
No similar papers found.