AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

πŸ“… 2026-06-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing world-action models struggle to simultaneously achieve long-horizon scene understanding and real-time control due to their coupling of perception and action at the same temporal resolution. This work proposes an asynchronous, horizon-adaptive world-action modeling framework that decouples the temporal rhythms of perception and action through a dual-diffusion Transformer architecture. The approach incorporates observation-guided video context routing (OVCR), rolling key-value memory, and inter-layer joint attention mechanisms. Evaluated on RoboTwin, the method achieves an average success rate of 92.80%, with 78.3% success across four real-world tasks, operating at a closed-loop frequency of 24.17 Hzβ€”4.59Γ— faster than Fast-WAMβ€”without requiring any pretraining on robotic data.
πŸ“ Abstract
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.
Problem

Research questions and friction points this paper is trying to address.

world-action modeling
temporal resolution
asynchronous execution
robot manipulation
video context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous World-Action Modeling
Horizon-Adaptive Planning
Observation-Guided Context Routing
Dual Diffusion Transformer
Embodied Control
πŸ”Ž Similar Papers
No similar papers found.
J
Jisong Cai
Shanghai Jiao Tong University; Shanghai AI Laboratory
Long Ling
Long Ling
Tongji University.
Human AI InteractionHCIDigital Fabrication
S
Shiwei Chu
Shanghai Jiao Tong University
Z
Zhongshan Liu
Baidu AI Cloud
J
Jiayue Kang
Shanghai Jiao Tong University
Zhixuan Liang
Zhixuan Liang
University of Hong Kong
Embodied AIMachine LearningRoboticsComputer Vision
Wenjie Xu
Wenjie Xu
Phd Student, Wuhan University
Knowledge GraphNLP
Y
Yinan Mao
Baidu AI Cloud
Weinan Zhang
Weinan Zhang
Professor, Shanghai Jiao Tong University
Reinforcement LearningAgentsData Science
X
Xiaokang Yang
Shanghai Jiao Tong University
R
Ru Ying
Baidu AI Cloud
R
Ran Zheng
Baidu AI Cloud
Y
Yao Mu
Shanghai Jiao Tong University; Shanghai AI Laboratory