🤖 AI Summary
Existing instruction-based image editing (IBIE) methods are constrained by small-scale, low-quality datasets, leading to limited editing diversity, high noise levels, and strong biases—hindering robust performance on complex semantic edits. To address this, we propose a novel two-stage multimodal large language model (MLLM)-collaborative construction paradigm: first, an MLLM generates visually adaptive editing instructions; second, another MLLM synthesizes high-fidelity edited images. This yields a high-quality dataset of 107K image–instruction pairs, covering 18 non-style-transfer and 38 style-transfer editing categories. Fine-tuning open-source models on this dataset achieves significant improvements in complex editing performance on the MultiEdit-Test benchmark, while preserving standard task capabilities. Our results empirically validate both the efficacy and generalizability of the proposed data curation paradigm and the critical role of high-quality, semantically rich training data in advancing IBIE.
📝 Abstract
Current instruction-based image editing (IBIE) methods struggle with challenging editing tasks, as both editing types and sample counts of existing datasets are limited. Moreover, traditional dataset construction often contains noisy image-caption pairs, which may introduce biases and limit model capabilities in complex editing scenarios. To address these limitations, we introduce MultiEdit, a comprehensive dataset featuring over 107K high-quality image editing samples. It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations, covering a spectrum from sophisticated style transfer to complex semantic operations like person reference editing and in-image text editing. We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions and produce high-fidelity edited images, respectively. Extensive experiments demonstrate that fine-tuning foundational open-source models with our MultiEdit-Train set substantially improves models' performance on sophisticated editing tasks in our proposed MultiEdit-Test benchmark, while effectively preserving their capabilities on the standard editing benchmark. We believe MultiEdit provides a valuable resource for advancing research into more diverse and challenging IBIE capabilities. Our dataset is available at https://huggingface.co/datasets/inclusionAI/MultiEdit.