🤖 AI Summary
Existing image editing benchmarks are largely confined to appearance-level adjustments and struggle to evaluate the ability of models to jointly satisfy complex instructions and multidimensional constraints—such as geometric, physical, and usability requirements—in professional visual tasks. To address this gap, this work proposes CV-Arena, an open evaluation benchmark for professional-grade instruction-driven image editing, comprising 12K high-resolution real image–instruction pairs. We introduce an Active Elo framework for human–AI collaborative preference learning, a logic-gated multidimensional evaluator named CV-Judge, and CV-Agent, a lightweight reasoning model featuring a planning–editing–verification closed-loop architecture. Evaluation across 21 systems reveals significant shortcomings in current approaches regarding instruction following, physical reasoning, and structural control, while demonstrating that the closed-loop design substantially enhances performance in professional editing scenarios.
📝 Abstract
Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.