$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing embodied policies, which lack explicit evaluation of action consequences, and conventional world models, which are often restricted to passive prediction. To bridge this gap, the authors propose the ω-EVA framework, which establishes the first closed-loop interaction between policy and world model through a three-stage “imagine–verify–act” mechanism. Operating entirely in latent space, ω-EVA efficiently reasons about action outcomes without generating future videos. The approach integrates an action-conditioned latent dynamics model, a language-conditioned flow policy, and a dynamics-aware visual representation, further enhanced by a three-branch refiner that jointly optimizes states, actions, and their conditional future representations. Experiments demonstrate that ω-EVA significantly improves policy performance across multiple simulated environments, achieving state-of-the-art results with only ~1.2B parameters and without requiring additional robot data for pretraining, thereby offering an exceptional trade-off among performance, model scale, and data efficiency.

📝 Abstract

Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

Problem

Research questions and friction points this paper is trying to address.

embodied intelligence

world models

action consequences

latent dynamics

interactive reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent interactive world model

Envision-Verify-Act loop

action-conditioned dynamics