Object-Centric Image to Video Generation with Language Guidance

πŸ“… 2025-02-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited scalability and semantic controllability of object-centric models in complex scenes, hindering autonomous systems’ environmental understanding and future event prediction. To this end, we propose the first text-guided object-centric image-to-video generation framework: an object slot encoder first parses the input image into structured, interpretable object representations; a text-conditioned Transformer-based temporal predictor then jointly models object dynamics and inter-object interactions to generate fine-grained, language-driven future frames. Our key contribution is the first deep integration of structured object-centric representations with fine-grained linguistic guidance. Experiments demonstrate that our method outperforms state-of-the-art image-to-video models across multiple benchmarks, achieving significant improvements in physical plausibility, editable controllability, and cross-scene generalization.

Technology Category

Application Category

πŸ“ Abstract
Accurate and flexible world models are crucial for autonomous systems to understand their environment and predict future events. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions, but often face challenges in scaling to complex datasets and incorporating external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions. Our method's structured latent space offers enhanced control over the prediction process, outperforming several image-to-video generative baselines. Additionally, we demonstrate that structured object-centric representations provide superior controllability and interpretability, facilitating the modeling of object dynamics and enabling more precise and understandable predictions. Videos and code are available at https://play-slot.github.io/TextOCVP/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing object-centric video generation
Integrating textual guidance in predictions
Improving controllability and interpretability of models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric image-to-video model
Text-conditioned transformer predictor
Structured latent space control
πŸ”Ž Similar Papers
No similar papers found.