🤖 AI Summary
Existing evaluations focus solely on task success rates, failing to reveal the true behavioral and representational advantages of World Action Models (WAMs) over Vision-Language-Action (VLA) models. This work proposes the first model-agnostic diagnostic framework that integrates both behavioral and representational perspectives, leveraging behavioral rollback analysis and sparse autoencoder-based feature disentanglement to systematically compare these model classes in terms of goal selectivity, behavioral dynamics, and internal representational structure. Evaluating seven policies on LIBERO and RoboTwin2.0, we find that while WAMs enhance goal-directed behavior, they incur higher inference costs. Among WAM variants, sequential WAMs exhibit the clearest predictive representations, whereas auxiliary and joint formulations compress or entangle future information, respectively.
📝 Abstract
Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.