World-in-World: World Models in a Closed-Loop World

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work investigates whether generative world models (WMs) can enable effective decision-making for embodied agents in closed-loop environments. To this end, we introduce World-in-World—the first closed-loop evaluation platform tailored to embodied tasks—where task success rate, rather than visual fidelity, serves as the primary metric. Methodologically, we design a unified online planning strategy and standardized action interface, establishing the first WM data scaling law for embodied settings; we further integrate generative modeling, action-observation-driven training, differentiable control, and API-based deployment to support heterogeneous WM integration. Experiments reveal that controllability substantially outweighs visual quality in determining task success; post-training data scaling outperforms architectural upgrades in improving generator performance; and increased inference compute yields near-linear gains in success rate. These findings advance WMs from perceptual prediction toward real-world interactive control.

Technology Category

Application Category

📝 Abstract

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating world models' embodied utility beyond visual quality alone

Developing closed-loop benchmarks for agent decision-making performance

Investigating scaling laws and controllability in embodied AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop platform benchmarks world models

Unified planning strategy enables decision making

Task success prioritized over visual quality

🔎 Similar Papers

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond