Latent Action Pretraining Through World Modeling

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models rely heavily on large-scale, manually annotated action data and massive parameter counts, resulting in high deployment costs. Moreover, prevailing action representation methods suffer from either annotation dependency or inadequate modeling of inter-frame dynamics, limiting generalizability and practicality. To address these issues, we propose LAW-M, the first model-agnostic, self-supervised latent action pretraining framework. LAW-M leverages a world model to directly capture visual dynamics from unlabeled robot- or human-operated videos, disentangling and learning compact, transferable latent action representations. These representations enable cross-task, cross-environment, and cross-embodiment transfer. Evaluated on the LIBERO benchmark and real-robot experiments, LAW-M significantly outperforms both supervised VLA models and existing pretraining approaches—achieving state-of-the-art performance while substantially improving deployment efficiency.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $π_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is designed to be effective for transferring across tasks, environments, and embodiments. It outperforms models trained with ground-truth robotics actions and similar pretraining methods on the LIBERO benchmark and real-world setup, while being significantly more efficient and practical for real-world settings.
Problem

Research questions and friction points this paper is trying to address.

Pretraining imitation learning models without labeled action data
Enabling efficient deployment of large vision-language-action models
Transferring learned skills across tasks, environments, and robot embodiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pretraining through world modeling
Learning latent actions from unlabeled video data
Model-agnostic framework for efficient transfer learning
🔎 Similar Papers
No similar papers found.
Bahey Tharwat
Bahey Tharwat
MBZUAI
Computer VisionEmbodied AI
Y
Yara Nasser
Alexandria University, Alexandria, Egypt
Ali Abouzeid
Ali Abouzeid
Msc @MBZUAI, B.Eng @UTM JB
Robot Learning3D Vision
I
Ian Reid
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE