🤖 AI Summary
Existing video diffusion models struggle to support diverse natural language–driven editing operations—such as object addition, removal, and attribute modification—within a unified framework. To address this, we propose the first instruction-driven unified video editing framework that synergistically integrates multimodal large language models (MLLMs) with video diffusion models, enabling holistic cross-frame semantic understanding, visual grounding, and referential segmentation. Our method introduces three key innovations: (1) an instruction-video cross-modal alignment mechanism, (2) a dynamic image-to-video injection strategy, and (3) a progressive curriculum learning paradigm. Furthermore, we construct the first high-quality, multi-task instructional video editing dataset. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods on multi-skill instruction-following editing, while also exhibiting strong generalization capabilities in zero-shot multimodal instruction editing and in-context editing.
📝 Abstract
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.