Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing zero-shot video editing methods struggle to preserve the high-level temporal structure of long videos, often resulting in narrative discontinuities and semantic ambiguity after editing. This work addresses this limitation by explicitly modeling and retaining the original video’s temporal coherence under a zero-shot setting. We introduce an adaptive segmentation strategy based on semantic similarity, coupled with an anchor-frame guidance mechanism and a segment-adaptive token merging and alternating fusion scheme. Building upon pretrained diffusion models, our approach identifies key anchor frames through feature similarity analysis, ensuring high intra-segment editing fidelity while enabling smooth inter-segment transitions. Experiments demonstrate that our method significantly improves temporal consistency without sacrificing computational efficiency, achieving state-of-the-art performance in zero-shot video editing.

📝 Abstract

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

Problem

Research questions and friction points this paper is trying to address.

temporal structure

zero-shot video editing

narrative coherence

semantic flow

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal structure preservation

zero-shot video editing

adaptive clip partitioning