A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study investigates whether supervised fine-tuning enables large language models to acquire world-model representations and reasoning capabilities for end-to-end planning. To this end, we propose an interpretability framework tailored to planning-oriented large language models, integrating linear probing, internal representation analysis, and generative evaluation to systematically examine how models encode action validity and state predicates. Our experiments reveal that internal representations can linearly separate valid from invalid actions—even when the output layer fails to classify them accurately—and demonstrate that the breadth of state-space coverage in the training data significantly influences the fidelity of recovered world models, underscoring the critical role of data diversity in shaping planning competence.

📝 Abstract

Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

Problem

Research questions and friction points this paper is trying to address.

world model recovery

supervised fine-tuning

large language models

classical planning

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model recovery

supervised fine-tuning

interpretability