π€ AI Summary
This work addresses the challenges of cross-environment continual dynamic modeling and zero-shot action adaptation. We propose WLA, the first continuous latent action representation framework grounded in Lie group theory. WLA abandons discrete action assumptions and instead models action dynamics via Lie group manifolds to achieve semantic disentanglement of actions and joint representation learning across environments. Integrated with an object-centric autoencoder and unsupervised continuous action learning, it requires only raw video frames for trainingβno action labels or environment-specific supervision. Evaluated on both synthetic and real-world datasets, WLA demonstrates significantly improved cross-environment generalization, enables rapid adaptation to unseen environments and novel action classes, and reduces reliance on action labels to near zero. To our knowledge, WLA is the first method to unify high controllability, strong predictive accuracy, and robust transferability in a single continuous action representation framework.
π Abstract
Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.