X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to effectively internalize physical world knowledge, as their video prediction often degenerates into simplistic extrapolation and fails to jointly capture instantaneous dynamics and long-term causal dependencies. This work proposes a VLA architecture embedded with a predictive world model that leverages a chunked autoregressive mechanism for efficient long-horizon causal forecasting. It introduces a temporal importance sampling strategy grounded in egomotion and behavioral signals, combined with a curriculum-based progressive training paradigm. Furthermore, a diffusion-based multi-view renderer is integrated to enhance visual fidelity. The resulting approach substantially improves long-term planning capabilities while preserving high-fidelity visual generation, establishing a new paradigm for knowledge-driven autonomous agents.

📝 Abstract

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

Problem

Research questions and friction points this paper is trying to address.

predictive world modeling

Vision-Language-Action models

long-horizon causality

video prediction

physical world knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

predictive world modeling

chunk-wise autoregressive forecasting

temporal importance sampling