🤖 AI Summary
Existing vision-language-action (VLA) models struggle to effectively internalize physical world knowledge, as their video prediction often degenerates into simplistic extrapolation and fails to jointly capture instantaneous dynamics and long-term causal dependencies. This work proposes a VLA architecture embedded with a predictive world model that leverages a chunked autoregressive mechanism for efficient long-horizon causal forecasting. It introduces a temporal importance sampling strategy grounded in egomotion and behavioral signals, combined with a curriculum-based progressive training paradigm. Furthermore, a diffusion-based multi-view renderer is integrated to enhance visual fidelity. The resulting approach substantially improves long-term planning capabilities while preserving high-fidelity visual generation, establishing a new paradigm for knowledge-driven autonomous agents.
📝 Abstract
Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality.
To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance.
Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.