World Modeling with Probabilistic Structure Integration

📅 2025-09-10
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing world models lack both strong controllability and flexible prompting capabilities for structured scene understanding. Method: We propose a “probabilistic prediction–structural extraction–integrated optimization” three-stage iterative learning framework. First, zero-shot causal inference disentangles implicit intermediate representations (e.g., optical flow, depth, semantic segmentation) from raw video data; these are then encoded as novel learnable tokens integrated into a unified, LLM-inspired prompting architecture. Technically, the framework synergistically combines probabilistic graphical models, stochastic autoregressive modeling with random access, causal inference, and self-supervised learning. Contribution/Results: Evaluated on trillion-frame video datasets, our model achieves state-of-the-art performance across multiple vision tasks—including optical flow estimation, monocular depth prediction, and object segmentation—while enabling cross-task prompt-based control and continual performance improvement. To our knowledge, this is the first work to unify structured world modeling with general-purpose, instruction-tunable prompting mechanisms.

Technology Category

Application Category

📝 Abstract
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.
Problem

Research questions and friction points this paper is trying to address.

Learning controllable world models from video data
Extracting low-dimensional structures via causal inference
Improving video prediction and understanding capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic graphical model for sequence prediction
Zero-shot structure extraction via causal inference
Continuous integration of structures as tokens
🔎 Similar Papers
No similar papers found.
Klemen Kotar
Klemen Kotar
PhD Candidate, Stanford University
Artificial Intelligence
W
Wanhee Lee
Stanford NeuroAI Lab
R
Rahul Venkatesh
Stanford NeuroAI Lab
H
Honglin Chen
Stanford NeuroAI Lab
Daniel Bear
Daniel Bear
Stanford University
Sensory SystemsPerceptionEvolutionArtificial Intelligence
J
Jared Watrous
Stanford NeuroAI Lab
S
Simon Kim
Stanford NeuroAI Lab
K
Khai Loong Aw
Stanford NeuroAI Lab
L
Lilian Naing Chen
Stanford NeuroAI Lab
Stefan Stojanov
Stefan Stojanov
Postdoc at Stanford Vision Lab and Neuro AI Lab
Computer VisionMachine Learning
Kevin Feigelis
Kevin Feigelis
Stanford University
I
Imran Thobani
Stanford NeuroAI Lab
A
Alex Durango
Stanford NeuroAI Lab
K
Khaled Jedoui
Stanford NeuroAI Lab
A
Atlas Kazemian
Stanford NeuroAI Lab
D
Dan Yamins
Stanford NeuroAI Lab