Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic image editing faces two key bottlenecks: inversion-based methods suffer from reconstruction artifacts, while instruction-based approaches are constrained by the quality and scale of annotated instruction data. This paper introduces DescriptiveEdit, the first framework to formulate image editing as a reference-image-guided text-to-image generation task—eliminating both inversion and task-specific instruction datasets. Its core innovation is the Cross-Attentive UNet, which seamlessly fuses reference image features and textual descriptions via cross-attention, without modifying pre-trained model architectures or performing image inversion. The design natively supports integration with extension modules such as ControlNet and IP-Adapter. Evaluated on the Emu Edit benchmark, DescriptiveEdit achieves significant improvements in editing accuracy and cross-region consistency, demonstrating strong effectiveness, generalizability, and extensibility for complex semantic editing tasks.

Technology Category

Application Category

📝 Abstract
Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.
Problem

Research questions and friction points this paper is trying to address.

Addresses semantic image editing limitations with inversion errors
Overcomes instruction-based model dataset quality constraints
Reframes editing as reference-based text-to-image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DescriptiveEdit framework re-frames editing as reference-based generation
Cross-Attentive UNet injects reference features via attention bridges
Seamlessly integrates with ControlNet and IP-Adapter extensions
🔎 Similar Papers
No similar papers found.
E
En Ci
Nanjing University, China
S
Shanyan Guan
vivo, China
Y
Yanhao Ge
vivo, China
Yilin Zhang
Yilin Zhang
Michigan State University
NanotechnologyPolymersSustainable AgricultureEnvironmental ChemistryBiopolymers
W
Wei Li
vivo, China
Z
Zhenyu Zhang
Nanjing University, China
J
Jian Yang
Nanjing University, China
Y
Ying Tai
Nanjing University, China