MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-based image editing (IBIE) methods are constrained by small-scale, low-quality datasets, leading to limited editing diversity, high noise levels, and strong biases—hindering robust performance on complex semantic edits. To address this, we propose a novel two-stage multimodal large language model (MLLM)-collaborative construction paradigm: first, an MLLM generates visually adaptive editing instructions; second, another MLLM synthesizes high-fidelity edited images. This yields a high-quality dataset of 107K image–instruction pairs, covering 18 non-style-transfer and 38 style-transfer editing categories. Fine-tuning open-source models on this dataset achieves significant improvements in complex editing performance on the MultiEdit-Test benchmark, while preserving standard task capabilities. Our results empirically validate both the efficacy and generalizability of the proposed data curation paradigm and the critical role of high-quality, semantically rich training data in advancing IBIE.

Technology Category

Application Category

📝 Abstract
Current instruction-based image editing (IBIE) methods struggle with challenging editing tasks, as both editing types and sample counts of existing datasets are limited. Moreover, traditional dataset construction often contains noisy image-caption pairs, which may introduce biases and limit model capabilities in complex editing scenarios. To address these limitations, we introduce MultiEdit, a comprehensive dataset featuring over 107K high-quality image editing samples. It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations, covering a spectrum from sophisticated style transfer to complex semantic operations like person reference editing and in-image text editing. We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions and produce high-fidelity edited images, respectively. Extensive experiments demonstrate that fine-tuning foundational open-source models with our MultiEdit-Train set substantially improves models' performance on sophisticated editing tasks in our proposed MultiEdit-Test benchmark, while effectively preserving their capabilities on the standard editing benchmark. We believe MultiEdit provides a valuable resource for advancing research into more diverse and challenging IBIE capabilities. Our dataset is available at https://huggingface.co/datasets/inclusionAI/MultiEdit.
Problem

Research questions and friction points this paper is trying to address.

Limited editing types and sample counts in current instruction-based image editing datasets
Noisy image-caption pairs introducing biases in complex editing scenarios
Inadequate performance on challenging editing tasks like semantic operations and style transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes multi-modal LLMs for instruction generation
Creates 107K high-quality image editing samples
Covers 18 editing types and 38 style transfers
🔎 Similar Papers
No similar papers found.