Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the limitation of existing embodied vision-language planning models, which overly rely on linguistic statistical priors and lack explicit modeling of physical causal relationships, thereby hindering genuine physical autonomy. To overcome this, the study proposes a paradigm shift toward physics-driven causal reasoning and introduces the first framework that transitions from language token prediction to explicit causal planning. Key contributions include the development of Causal-Plan-Bench—a high-fidelity diagnostic benchmark—the release of Causal-Plan-1M, a million-scale corpus for causal reasoning, and the training of a causal planner based on Qwen3-VL-8B endowed with internalized physical logic. Experiments demonstrate that the proposed method substantially outperforms current state-of-the-art models, achieving a 36.3% relative performance gain (from 33.22 to 45.28) on in-domain and cross-benchmark evaluations, surpassing even Gemini 3 Pro (38.18).

📝 Abstract

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

Problem

Research questions and friction points this paper is trying to address.

embodied planning

causal reasoning

physical grounding

token prediction

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

embodied planning

physical grounding