WALL-WM: Carving World Action Modeling at the Event Joints

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
Existing world action models struggle to align the semantic granularity of language, vision, and action within fixed-length segments. This work proposes a novel event-centric paradigm by treating semantically coherent action events as the fundamental unit, introducing the first event-based vision–language–action (VLA) pretraining framework. The framework supports variable-length execution, unified reasoning, and end-to-end differentiable VLA pathways. It integrates event-level pretraining, staged decoding, cluster-balanced sampling, and the Muon optimizer to achieve robust learning. Evaluated on large-scale real-world scenarios, the method demonstrates substantial improvements in generalization across languages, environments, and tasks, establishing state-of-the-art performance.
📝 Abstract
WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
Problem

Research questions and friction points this paper is trying to address.

World Action Model
granularity mismatch
Vision-Language-Action
semantic events
action chunk
Innovation

Methods, ideas, or system contributions that make the work stand out.

event-grounded learning
Vision-Language-Action (VLA)
semantic events
Staircase Decoding
world action modeling
🔎 Similar Papers
No similar papers found.