PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

📅 2026-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video editing methods struggle to maintain long-term semantic and structural consistency, primarily due to outdated contextual memory. This work proposes a decoupled multimodal context memory mechanism that constructs separate RGB and depth memory banks to model appearance semantics and geometric structure independently. By incorporating an edit-aware memory update and retrieval strategy, the method enables temporally and viewpoint-consistent video generation. Experimental results demonstrate that the proposed approach significantly outperforms current state-of-the-art techniques after editing, effectively preserving long-range semantic and structural coherence while exhibiting strong robustness.
📝 Abstract
Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

consistent video generation
video editing
long-term consistency
context memory
disentangled representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled memory
consistent video generation
edit-aware memory update
multi-modal context memory
long-term coherence
🔎 Similar Papers
No similar papers found.