World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of effectively integrating concrete visual future simulation with abstract linguistic reasoning to enhance model accuracy and robustness in physical and spatial prediction tasks. The authors propose a Controlled Concretized Reasoning framework that systematically models the complementary mechanisms between world models and multimodal large language models for the first time. Central to this approach is the Privileged-Future On-Policy Self-Distillation (PF-OPSD) method, which leverages only ground-truth future videos as privileged context during training for on-policy self-distillation. To evaluate the approach, two new benchmarks—VRQABench and OpenWorldQA—are introduced. Experimental results demonstrate that the proposed method outperforms baseline models by 10.6% and 10.9% on these benchmarks, respectively, while significantly improving robustness against noisy or inconsistent visual rollouts.

📝 Abstract

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

Problem

Research questions and friction points this paper is trying to address.

world models

multimodal large language models

concrete reasoning

abstract reasoning

visual simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

controlled concrete reasoning

world models

multimodal large language models