Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

๐Ÿ“… 2026-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high computational cost and inference latency of existing world-action models, which rely on high-fidelity future video prediction and thus hinder real-time deployment. To overcome this limitation, the authors propose a lightweight world model that treats future video prediction as a guiding signal rather than a reconstruction target. The approach leverages a compact video expert adapted from WAN-2.2-5B, employs token-sparse latent video representations, and introduces an asymmetric video-action denoising mechanism to substantially reduce computational overhead. Evaluated on RoboTwin 2.0 and real-world manipulation tasks, the method maintains strong control performance while achieving an inference latency of approximately 100 milliseconds on a single GPUโ€”representing a 30-fold speedup over current world-action models (WAMs).
๐Ÿ“ Abstract
World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.
Problem

Research questions and friction points this paper is trying to address.

World-Action Models
future visual prediction
inference latency
real-time robot deployment
embodied control
Innovation

Methods, ideas, or system contributions that make the work stand out.

World-Action Model
efficient inference
future imagination
asymmetric denoising
video-action generation
๐Ÿ”Ž Similar Papers
J
Jiajun Li
The University of Hong Kong
T
Tiecheng Guo
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yifan Ye
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
R
Rongyu Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Q
Qianpu Sun
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yunfan Lou
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yan Huang
Yan Huang
Institute of Automation, Chinese Academy of Sciences
computer visiondeep learningmultimodal learning
Zhihe Lu
Zhihe Lu
HBKU<--NUS<--University of Surrey<--CASIA
Computer VisionTransfer LearningFew-shot LearningMultimodel LearningContinual Learning
Meng Guo
Meng Guo
Peking University
Task and Motion PlanningMulti-robot SystemsRobotic Manipulation
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models