🤖 AI Summary
This work addresses the challenge of zero-shot, text-driven dynamic content generation and editing in real-world videos—specifically, synthesizing novel dynamic objects within an input video and achieving physically consistent scene interactions (e.g., occlusion, motion coherence, appearance blending) solely from natural language instructions. To this end, we propose a training-free, attention-guided feature-space manipulation strategy that precisely localizes and integrates new content without fine-tuning, thereby preserving the original video’s structural integrity. Our method synergistically combines a pre-trained text-to-video diffusion Transformer with a vision-language model to form an end-to-end inference framework. Extensive experiments demonstrate that our approach achieves high-fidelity, fully automatic dynamic editing on real videos featuring complex camera motion and multi-object interactions, significantly outperforming existing text-driven video editing methods in both visual quality and physical plausibility.
📝 Abstract
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.