🤖 AI Summary
Existing world action models struggle with long-horizon, complex tasks due to their reliance on video prediction, which hinders the simultaneous integration of high-level planning and fine-grained control, and lacks adaptive multimodal reasoning capabilities. This work proposes AdaWAM, which introduces, for the first time, a lightweight, execution-context-aware dynamic routing mechanism that enables the model to autonomously switch between textual reasoning—used for task transitions—and visual reasoning—employed for precise manipulation—according to real-time task demands. Built upon the foundational world action model architecture, AdaWAM achieves efficient and accurate action decision-making, significantly improving reasoning efficiency in both simulated and real-world environments while outperforming current state-of-the-art embodied intelligence strategies.
📝 Abstract
World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.