๐ค AI Summary
This work addresses the limitation of existing world action models, which predominantly rely on pixel-reconstruction-based video tokenizers and struggle to learn action dynamics aligned with language instructions for robotic control. To overcome this, the authors propose a representation-centric world action model featuring a semantic visionโaction joint tokenizer that maps visual inputs into aligned latent visual and action tokens. The model jointly predicts future visual states and the intervening implicit actions conditioned on natural language instructions. Through language-conditioned pretraining followed by fine-tuning on real robot trajectories, the approach significantly outperforms baseline methods in both simulated and real-world manipulation tasks. These results demonstrate the effectiveness of the semantic tokenization mechanism in enhancing instruction following, generalization, and closed-loop control performance.
๐ Abstract
This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.