🤖 AI Summary
This work addresses the limitation of existing image editing models and evaluation benchmarks, which predominantly rely on textual instructions and struggle to support visual directives—such as sketches—that are integral to human multimodal interaction. To bridge this gap, we introduce VIBE, the first systematic benchmark for vision-instructed image editing, which defines a three-tiered hierarchy of task complexity ranging from referential localization and shape manipulation to causal reasoning. We further develop a fine-grained automatic evaluation framework based on multimodal large language models (LMMs) and use it to assess 17 open- and closed-source models across diverse visual instructions. Our evaluation reveals that closed-source models exhibit初步 stronger instruction-following capabilities, yet all models suffer significant performance degradation on higher-order tasks, highlighting critical limitations and pointing toward promising directions for future research.
📝 Abstract
Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.