StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based world models suffer from poor long-range visual consistency due to reliance on short observation sequences, causing generated scenes to rapidly diverge from historical context. To address this, we propose the first integration of state space models—specifically Mamba—into diffusion-based world modeling, yielding a unified architecture for sequential representation learning and conditional image generation with explicit long-term memory. Our approach preserves generation fidelity while substantially improving temporal coherence. We introduce a customized evaluation protocol to quantitatively assess long-horizon consistency. Experiments in 2D maze and complex 3D environments demonstrate that our method improves visual context coherence by over an order of magnitude for rollouts exceeding 100 steps compared to diffusion-only baselines, effectively mitigating the long-term state forgetting problem inherent in conventional diffusion world models.

Technology Category

Application Category

📝 Abstract
World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.
Problem

Research questions and friction points this paper is trying to address.

Diffusion world models lack long-term context memory
Visual consistency breaks down after few steps
State-space integration needed for lasting environment state
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates state-space model with diffusion model
Enables long-context tasks with lasting memory
Maintains visual consistency over extended steps
🔎 Similar Papers
No similar papers found.
N
N. Savov
INSAIT, Sofia University "St. Kliment Ohridski"
N
Naser Kazemi
INSAIT, Sofia University "St. Kliment Ohridski"
Deheng Zhang
Deheng Zhang
Doctoral Student, INSAIT
Computer GraphicsComputer Vision
D
D. Paudel
INSAIT, Sofia University "St. Kliment Ohridski"
X
Xi Wang
INSAIT, Sofia University "St. Kliment Ohridski", ETH Zurich, TU Munich
L
L. V. Gool
INSAIT, Sofia University "St. Kliment Ohridski"