🤖 AI Summary
Existing video editing methods for Video Diffusion Transformers (Video DiTs) require architectural modifications—such as attention mechanism redesign—or costly fine-tuning, incurring substantial computational overhead. This paper introduces DFVEdit, a zero-shot, fine-tuning-free video editing framework that unifies editing and sampling under a continuous flow perspective. It proposes the Conditioned Delta Flow Vector (CDFV), an unbiased estimator of conditional motion dynamics, and employs Implicit Cross-Attention (ICA) and Embedding Reinforcement (ER) to achieve high-fidelity latent-space guidance. By operating directly on clean latent representations, DFVEdit avoids attention reconstruction and parameter updates entirely. Evaluated on state-of-the-art Video DiTs—including CogVideoX and Wan2.1—DFVEdit achieves ≥20× inference speedup and 85% memory reduction while attaining SOTA structural fidelity and spatiotemporal consistency.
📝 Abstract
The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.