World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of unifying world modeling, language reasoning, and action generation to enable robots to perform complex, long-horizon tasks. The authors propose WLA-0, a framework built upon an autoregressive Transformer backbone that employs meta-queries to decouple and coordinate a World Expert and an Action Expert. This architecture jointly processes textual instructions, visual inputs, and state observations to simultaneously predict subtasks, subgoal images, and actions. A key innovation is an implicit world-prediction mechanism that influences action generation yet allows the world modeling module to be disabled during inference. The model further leverages cross-embodiment videos without action annotations for learning new tasks. Evaluated on RoboTwin2.0 Clean and RMBench, WLA-0 achieves success rates of 92.94% and 56.5%, respectively, with its 2B activated-parameter variant running at 40 ms per inference step on an RTX 5090, demonstrating strong multitask and long-horizon decision-making capabilities.

📝 Abstract

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

Problem

Research questions and friction points this paper is trying to address.

world modeling

language reasoning

action synthesis

embodied AI

long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

world-language-action model

autoregressive transformer

world modeling