Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work investigates whether vision-language foundation models implicitly encode world models (mapping observation × action → observation) and dynamics models (mapping observation × observation → action), where actions are natural language descriptions. To address the challenge of jointly learning these complementary models, we propose a novel “dynamics-model-guided world-model” paradigm: a dynamics model generates synthetic data for weakly supervised world-model training, and an importance-weighted frame-pair contrastive learning objective is introduced. During inference, a multi-sample scoring mechanism enables guided search. Our method integrates multimodal fine-tuning, GPT-4o-as-judge automated evaluation, and the Aurora-Bench human evaluation framework. On the Aurora-Bench action-centric image editing benchmark, our approach surpasses prior state-of-the-art by 15% under GPT-4o evaluation and achieves the highest average human score across all subsets.

Technology Category

Application Category

📝 Abstract

To what extent do vision-and-language foundation models possess a realistic world model (observation $ imes$ action $ ightarrow$ observation) and a dynamics model (observation $ imes$ observation $ ightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Problem

Research questions and friction points this paper is trying to address.

Assessing world and dynamics models in vision-language foundation models

Fine-tuning dynamics models to bootstrap world models effectively

Improving action-centric image editing via dynamics-guided strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning dynamics models via supervised learning

Bootstrapping world models with synthetic data

Inference-time verification using dynamics models

🔎 Similar Papers

No similar papers found.