Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the representational mismatch between visual future prediction and action control in existing world models, which often yields plausible video generations but inaccurate action decoding. To bridge this gap, the authors propose Action-Guided Representational Alignment (AGRA), a novel objective that explicitly aligns intermediate features of diffusion-based world models with the spatial semantic representations of a foundational vision encoder—tailoring model representations to better suit action decision-making. By integrating video diffusion models, vision encoders, attention analysis, and causal intervention, AGRA enhances the action decoder’s focus on task-relevant regions. Experiments demonstrate that this approach significantly improves object localization accuracy, affordance understanding, and policy robustness against irrelevant perturbations, outperforming current baselines on both in-distribution and out-of-distribution tasks.

📝 Abstract

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

Problem

Research questions and friction points this paper is trying to address.

World Action Models

representation mismatch

action grounding

visual reconstruction

action decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Models

Representation Alignment

Action Grounding