🤖 AI Summary
This work addresses the limited flexibility and controllability in interactive world modeling by proposing a framework centered on 3D human motion as the primary interaction modality. It enhances egocentric spatial perception through supervision from an external viewpoint and introduces a unified coordinate system that jointly leverages anchor views and textual descriptions to drive dynamic, customizable evolution of local scenes. The method significantly outperforms state-of-the-art approaches in both spatiotemporal geometric consistency and adherence to text-guided scene evolution, thereby improving the completeness and controllability of interactive modeling.
📝 Abstract
Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.