VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models struggle to support diverse natural language–driven editing operations—such as object addition, removal, and attribute modification—within a unified framework. To address this, we propose the first instruction-driven unified video editing framework that synergistically integrates multimodal large language models (MLLMs) with video diffusion models, enabling holistic cross-frame semantic understanding, visual grounding, and referential segmentation. Our method introduces three key innovations: (1) an instruction-video cross-modal alignment mechanism, (2) a dynamic image-to-video injection strategy, and (3) a progressive curriculum learning paradigm. Furthermore, we construct the first high-quality, multi-task instructional video editing dataset. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods on multi-skill instruction-following editing, while also exhibiting strong generalization capabilities in zero-shot multimodal instruction editing and in-context editing.

Technology Category

Application Category

📝 Abstract
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for instructional video editing tasks
Handling diverse tasks like adding, removing, changing in videos
Improving video object grounding and reasoning segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for video editing and reasoning
Curriculum learning with instructional image data
Data synthesis pipeline for video editing samples
🔎 Similar Papers
No similar papers found.