Latent Beam Diffusion Models for Decoding Image Sequences

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing image sequence generation methods suffer from visual discontinuity, particularly in nonlinear narrative scenarios where maintaining long-range inter-frame consistency remains challenging. Method: We propose a latent-space beam search decoding framework featuring a novel dynamic latent-space beam search strategy that abandons rigid prior assumptions. We further design a cross-attention path scoring and pruning mechanism to jointly optimize text alignment and global visual coherence within the diffusion model’s latent space. Results: Extensive human evaluations demonstrate significant improvements in sequence visual continuity, narrative logicality, and text–image alignment—outperforming all state-of-the-art baselines. Our approach establishes a new benchmark for structured image sequence generation, advancing both fidelity and semantic coherence in temporally extended visual synthesis.

Technology Category

Application Category

📝 Abstract

While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.

Problem

Research questions and friction points this paper is trying to address.

Ensuring visual consistency in image sequences

Generating coherent non-linear storytelling sequences

Reducing quadratic complexity in beam search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Beam search strategy for latent space exploration

Dynamic search for optimal latent representations

Cross-attention mechanism for efficient path scoring

🔎 Similar Papers

No similar papers found.