Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
This work addresses the degradation in visual fidelity and 3D consistency observed in long-horizon image-to-video generation, which stems from information loss in Latent–RGB cycles and a mismatch between training and inference memory dynamics. To mitigate these issues, the authors propose a memory-augmented framework featuring a novel geometry-aware latent Gaussian memory structure. This structure anchors diffusion latents via Gaussian primitives and enables efficient memory retrieval through latent-space Gaussian rasterization. Additionally, a dynamic bias archive drives a bias learning mechanism that simulates memory perturbations during inference by injecting a one-step approximate bias during training, thereby enhancing model robustness. Evaluated on ScanNet, DL3DV, and OmniWorldGame datasets, the method significantly outperforms existing approaches, achieving state-of-the-art performance in long-horizon controllable video generation.
πŸ“ Abstract
Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.
Problem

Research questions and friction points this paper is trying to address.

action-controlled video generation
3D consistency
autoregressive rollout
latent-RGB cycling
training-inference gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Gaussian Memory
Deviation Learning
Action-Controlled Video Generation
Memory-Augmented Diffusion
3D Consistency