🤖 AI Summary
This work addresses the limitations of existing video world models, which rely on explicit RGB-space point cloud memory and suffer from high computational overhead and information loss due to pixel-space reconstruction. The authors propose constructing a persistent 3D cache in diffusion latent space, where latent variables are lifted into 3D via depth-guided inverse projection and subsequent view synthesis and querying are performed entirely within the latent space. This approach is the first to maintain full 3D spatial consistency purely in latent space. By integrating latent-space 3D caching, depth-guided inverse projection, latent warping, and geometric priors from diffusion models, the method achieves high-fidelity reconstruction while accelerating end-to-end video generation by 10.57× and reducing memory consumption by 55×. It attains state-of-the-art performance on WorldScore and demonstrates strong results on RealEstate10K.
📝 Abstract
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.