PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current world models suffer from limited generalizability, lack of interactive capability, and insufficient long-horizon consistency, while video generation models lack causal control and action conditioning. This paper introduces GLP—a General, Interactive, and Long-horizon world model architecture—capable of high-fidelity, action-conditioned, video-level future state simulation in open domains. GLP unifies language instructions and visual dynamics within a shared latent space, integrating an autoregressive latent dynamics backbone driven by a large language model with a video diffusion decoder, trained on large-scale video–action pairs. Experiments demonstrate that GLP significantly outperforms existing methods on action-conditioned simulation, hundred-frame-long horizon prediction, and simulation-based reasoning tasks, exhibiting strong generalization and practical utility.

Technology Category

Application Category

📝 Abstract
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
Problem

Research questions and friction points this paper is trying to address.

Existing video models lack causal control and long-horizon consistency for reasoning
Current world models are restricted to specific domains with limited generalization
There is no unified system combining latent reasoning with realistic visual dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive latent dynamics with LLM backbone
Video diffusion decoder for visual reconstruction
Generative Latent Prediction architecture unification
🔎 Similar Papers
No similar papers found.
P
Pan Team Institute of Foundation Models Jiannan Xiang
Mohamed bin Zayed University of Artificial Intelligence
Y
Yi Gu
Mohamed bin Zayed University of Artificial Intelligence
Z
Zihan Liu
Mohamed bin Zayed University of Artificial Intelligence
Zeyu Feng
Zeyu Feng
Mohamed bin Zayed University of Artificial Intelligence
Qiyue Gao
Qiyue Gao
PhD student, UCSD
Y
Yiyan Hu
Mohamed bin Zayed University of Artificial Intelligence
B
Benhao Huang
Mohamed bin Zayed University of Artificial Intelligence
G
Guangyi Liu
Mohamed bin Zayed University of Artificial Intelligence
Y
Yichi Yang
Mohamed bin Zayed University of Artificial Intelligence
K
Kun Zhou
Mohamed bin Zayed University of Artificial Intelligence
Davit Abrahamyan
Davit Abrahamyan
MS CSE student, UC San Diego
Reinforcement LearningMultimodal LearningWorld ModelsReasoning
A
Arif Ahmad
Mohamed bin Zayed University of Artificial Intelligence
G
Ganesh Bannur
Mohamed bin Zayed University of Artificial Intelligence
J
Junrong Chen
Mohamed bin Zayed University of Artificial Intelligence
K
Kimi Chen
Mohamed bin Zayed University of Artificial Intelligence
M
Mingkai Deng
Mohamed bin Zayed University of Artificial Intelligence
Ruobing Han
Ruobing Han
Google
Computer Science
X
Xinqi Huang
Mohamed bin Zayed University of Artificial Intelligence
Haoqiang Kang
Haoqiang Kang
UC San Diego, PhD Student
Natural Language ProcessingMachine Learning
Zheqi Li
Zheqi Li
Mohamed bin Zayed University of Artificial Intelligence
E
Enze Ma
Mohamed bin Zayed University of Artificial Intelligence
H
Hector Ren
Mohamed bin Zayed University of Artificial Intelligence
Y
Yashowardhan Shinde
Mohamed bin Zayed University of Artificial Intelligence
R
Rohan Shingre
Mohamed bin Zayed University of Artificial Intelligence
R
Ramsundar Tanikella
Mohamed bin Zayed University of Artificial Intelligence
K
Kaiming Tao
Mohamed bin Zayed University of Artificial Intelligence
D
Dequan Yang
Mohamed bin Zayed University of Artificial Intelligence
X
Xinle Yu
Mohamed bin Zayed University of Artificial Intelligence
C
Cong Zeng
Mohamed bin Zayed University of Artificial Intelligence
B
Bing Zhou
Mohamed bin Zayed University of Artificial Intelligence
Zhengzhong Liu
Zhengzhong Liu
Institute of Foundation Models
Natural Language ProcessingMachine Learning
Zhiting Hu
Zhiting Hu
Assistant Professor at UC San Diego
Machine LearningArtificial IntelligenceNatural Language Processing
E
Eric P. Xing
Mohamed bin Zayed University of Artificial Intelligence