🤖 AI Summary
Existing vision editing models perform well on open-ended tasks but struggle with editing operations requiring precise, unambiguous outcomes. To address this limitation, this work proposes PaintBench—the first deterministic evaluation framework that is infinitely scalable, robust to data contamination, and free from subjective human judgment. PaintBench encompasses 20 fundamental precise editing operations across four categories and enables systematic assessment through procedurally generated samples, configurable complexity levels, pixel-level mIoU metrics, and diagnostic task decomposition. Experiments reveal that even the best-performing among 11 state-of-the-art models achieves only a 17.1% mIoU, with geometric transformations and structural manipulations proving particularly challenging; performance is also highly sensitive to scene-specific factors. Notably, PaintBench scores exhibit strong correlation (R²=0.91) with downstream performance on TinyGrafixBench, demonstrating its high generalizability.
📝 Abstract
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.