ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video editing methods rely on costly paired video data, limiting their scalability. This work proposes an efficient framework that trains high-quality video editing models using only image pairs: treating each image as a single-frame video, it freezes a pretrained 3D spatiotemporal attention module to decouple spatial and temporal learning; introduces a 2D spatial difference attention module that progressively injects spatial changes via a predict-and-update mechanism; and designs a text-guided dynamic semantic gating strategy to enable adaptive editing without external masks. Trained on merely 13K image pairs for just five epochs, the model achieves editing fidelity and temporal consistency comparable to those of large-scale video-trained counterparts, all at minimal computational cost.
📝 Abstract
Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.
Problem

Research questions and friction points this paper is trying to address.

video editing
paired video data
scalability
spatiotemporal decoupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial difference attention
image-to-video editing
temporal consistency preservation
text-guided semantic gating
decoupled spatiotemporal learning
🔎 Similar Papers
No similar papers found.