MoEdit: On Learning Quantity Perception for Multi-object Image Editing

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-object image editing methods struggle to simultaneously achieve fine-grained control over individual object attributes and maintain global consistency in object count, spatial distribution, and semantic coherence—leading to distortions in quantity, layout, and semantics post-editing. To address this, we propose MoEdit, the first framework integrating Feature-decoupled Compensation (FeCom) and Unsupervised Quantity-aware Token Attention (QTTN) into the Stable Diffusion architecture. FeCom enhances object-level attribute disentanglement, while QTTN enables count-consistent modeling without additional annotations or auxiliary models. MoEdit uniformly supports high-fidelity style transfer, object re-generation, and background reconstruction. Evaluated on multi-object editing benchmarks, it achieves state-of-the-art performance, significantly improving joint consistency across three critical dimensions: numerical accuracy, spatial plausibility, and visual fidelity.

Technology Category

Application Category

📝 Abstract
Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistent quantity perception in multi-object image editing.
Proposes MoEdit for high-quality editing without auxiliary tools.
Ensures object distinction and quantity consistency in complex edits.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Compensation module minimizes object attribute interlacing
Quantity Attention module ensures consistent quantity perception
Leverages Stable Diffusion for high-quality multi-object editing
🔎 Similar Papers
No similar papers found.
Y
Yanfeng Li
Macao Polytechnic University
K
Kahou Chan
Macao Polytechnic University
Y
Yue Sun
Macao Polytechnic University
C
Chantong Lam
Macao Polytechnic University
T
Tong Tong
Fuzhou University
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
Keren Fu
Keren Fu
Sichuan University, College of Computer Science
computer visionimage processingmachine learning
X
Xiaohong Liu
Shanghai Jiao Tong University
Tao Tan
Tao Tan
FCA MPU
Medical Imaging AI