🤖 AI Summary
This work addresses the limitations of current text-guided image editing methods, which struggle to simultaneously achieve faithful instruction following, minimal editing, and high visual quality in complex tasks involving spatial layout, motion, viewpoint, scale, or creative transformations. To this end, the authors propose TECCI—the first benchmark specifically designed for challenging editing scenarios—comprising 7,550 meticulously curated image-instruction pairs across seven image categories and five editing types. They introduce an innovative hybrid approach that combines human expertise with Gemini-generated instructions to create demanding prompts and develop a Gemini-based automatic evaluator to systematically assess state-of-the-art models along three core dimensions. Experimental results reveal that even the best-performing models achieve an overall success rate below 22%, with particularly poor performance on architectural and natural scenes, and reasoning- or creativity-intensive tasks proving most difficult; the proposed automatic scorer demonstrates 74.7% agreement with human evaluations.
📝 Abstract
Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.