Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high training costs and substantial inference latency of existing World Action Models, which rely on large-scale generative architectures and hinder their deployment as efficient closed-loop policies. To overcome these limitations, the authors propose a lightweight World Action Model featuring a compact video backbone that learns from future video supervision in a downsampled latent space. The model incorporates a StateFusionActionExpert module to aggregate multi-layer state features and employs a query pooling mechanism to directly predict action chunks, thereby circumventing the need for heavyweight generative action experts. With only 0.44 billion trainable parameters, the model achieves strong performance on the LIBERO benchmark, demonstrates effective multitask capabilities on RoboTwin 2.0, and attains an inference latency of 72.03 ms with a peak GPU memory consumption of 4.1 GiB, significantly improving both training throughput and deployment efficiency.
📝 Abstract
World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.
Problem

Research questions and friction points this paper is trying to address.

World Action Models
efficient robot policy
high training cost
inference latency
closed-loop deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Models
State-Fusion Action Decoding
Lightweight Robotics
Latent Video Prediction
Efficient Policy Learning
🔎 Similar Papers
No similar papers found.
Z
Ziang Li
Wuhan University
D
Dongzhou Cheng
Shanghai Innovation Institute
Yibin Wang
Yibin Wang
Intern at UIUC
Trustworthy AI
S
Shiyue Wang
East China Normal University
Xiaoyang Xu
Xiaoyang Xu
New Jersey Institute of Technology
biomaterialsnanomedicinedrug deliverynanotechnologytissue engineering
L
Lingxuan Weng
East China Normal University
J
Juan Wang
Wuhan University
Jiaqi Wang
Jiaqi Wang
Shanghai AI Laboratory
Computer VisionMulti-modal Learning