CoVEBench: Can Video Editing Models Handle Complex Instructions?

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video editing models struggle to effectively handle real-world, multi-dimensional, and highly coupled compositional editing instructions, and lack dedicated evaluation benchmarks. To address this gap, this work proposes CoVEBench—the first fine-grained benchmark for complex video editing—comprising 416 source videos, 626 multi-step editing instructions, and 9,990 structured checkpoints. CoVEBench innovatively integrates multimodal large language models (MLLMs) to perform human-aligned assessments of instruction adherence and content fidelity, complemented by automated video quality metrics to form a comprehensive evaluation framework. Experimental results reveal that current models commonly suffer from missed operations, constraint violations, and artifact introduction, demonstrating CoVEBench’s effectiveness and challenge in diagnosing complex video editing capabilities.

📝 Abstract

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

Problem

Research questions and friction points this paper is trying to address.

video editing

complex instructions

compositional editing

benchmark

spatiotemporal preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional video editing

video editing benchmark

multi-point instructions