Accurate and Efficient World Modeling with Masked Latent Transformers

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dreamer-style algorithms often discard critical dynamic information during latent-space compression, degrading agent performance; meanwhile, high-fidelity approaches such as Δ-IRIS and DIAMOND require pixel-level training, compromising computational efficiency and representation reusability. To address this trade-off, we propose EMERALD—the first world model framework integrating a Masked Latent Transformer (MLT) into the Dreamer architecture, enabling autoregressive trajectory modeling and high-fidelity reconstruction directly in latent space. EMERALD introduces a latent-space masking mechanism that predicts masked latent tokens without accessing raw pixels, substantially improving trajectory simulation quality and state representation fidelity. Evaluated on the Crafter benchmark, EMERALD achieves state-of-the-art performance within 10 million environment steps—surpassing human expert performance for the first time—and fully unlocks all 22 achievements.

Technology Category

Application Category

📝 Abstract
The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as $Δ$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.
Problem

Research questions and friction points this paper is trying to address.

Improving world model accuracy and efficiency
Addressing information loss in latent space
Enhancing agent performance with latent trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

EMERALD uses spatial latent state
MaskGIT predictions for accurate trajectories
Surpasses human experts in Crafter benchmark
🔎 Similar Papers
No similar papers found.