How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

📅 2026-02-02

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing image editing models and evaluation benchmarks, which predominantly rely on textual instructions and struggle to support visual directives—such as sketches—that are integral to human multimodal interaction. To bridge this gap, we introduce VIBE, the first systematic benchmark for vision-instructed image editing, which defines a three-tiered hierarchy of task complexity ranging from referential localization and shape manipulation to causal reasoning. We further develop a fine-grained automatic evaluation framework based on multimodal large language models (LMMs) and use it to assess 17 open- and closed-source models across diverse visual instructions. Our evaluation reveals that closed-source models exhibit初步 stronger instruction-following capabilities, yet all models suffer significant performance degradation on higher-order tasks, highlighting critical limitations and pointing toward promising directions for future research.

Technology Category

Application Category

📝 Abstract

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

Problem

Research questions and friction points this paper is trying to address.

visual instruction

image editing

multimodal communication

benchmark

instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual instruction

image editing

multimodal benchmark

LMM-as-a-judge

instruction following

🔎 Similar Papers

No similar papers found.

Authors to Follow