Better World Models Can Lead to Better Post-Training Performance

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how explicit world modeling objectives influence internal representations of Transformers and downstream reinforcement learning performance. We systematically compare state prediction pretraining versus joint state prediction and language modeling pretraining in a 2×2×2 Rubik’s Cube environment, followed by post-training with Group Relative Policy Optimization (GRPO). Using linear probing and causal intervention analysis, we find that explicit world modeling markedly improves the linear separability and causal controllability of state representations. Moreover, world model fidelity strongly correlates with post-trained policy performance—especially on high-complexity states—yielding substantial gains. Crucially, this study establishes, for the first time, an intrinsic link between world model representation quality and policy optimization efficiency. Our findings provide both theoretical grounding and empirical evidence for developing embodied AI models with interpretable and causally manipulable state representations.

Technology Category

Application Category

📝 Abstract
In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
Problem

Research questions and friction points this paper is trying to address.

Examine how world-modeling objectives affect Transformer representations and downstream performance
Assess impact of world-model quality on post-training reinforcement learning outcomes
Investigate if sharper state representations improve sequence-planning task effectiveness after post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit world-modeling objectives improve Transformer representations
State-prediction pretraining enhances linear decodability and causal steerability
Better state representations boost reinforcement learning post-training performance
🔎 Similar Papers
No similar papers found.
P
Prakhar Gupta
University of Michigan
H
Henry Conklin
Princeton University
Sarah-Jane Leslie
Sarah-Jane Leslie
Class of 1943 Professor, Philosophy & Statistics and Machine Learning, Princeton University
Cognitive science
A
Andrew Lee
Harvard University