InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of fine-grained, spatially precise image editing for visually inconspicuous targets in complex multi-entity scenes. The authors propose InterCoG, a framework that leverages spatial reasoning from textual instructions to localize editing targets, generates bounding boxes and masks via visual grounding, and rewrites editing commands to clarify intended outcomes. A key innovation is the introduction of a text–vision interleaved chain-of-grounding reasoning mechanism, augmented by multimodal grounding reconstruction supervision and a reasoning alignment module, which together significantly enhance localization accuracy and model interpretability. To support this research, the authors also construct the GroundEdit-45K dataset and the GroundEdit-Bench evaluation benchmark. Experimental results demonstrate that InterCoG substantially outperforms existing methods on fine-grained image editing tasks.

Technology Category

Application Category

📝 Abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

Problem

Research questions and friction points this paper is trying to address.

fine-grained image editing

spatial reasoning

multi-entity scenes

visual grounding

spatially precise editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved Chain-of-Grounding

spatially precise image editing

visual grounding

multimodal reasoning

fine-grained editing

🔎 Similar Papers

No similar papers found.

Authors to Follow