🤖 AI Summary
This work presents a systematic survey of denoising diffusion-based image editing, focusing on inpainting and outpainting, with particular emphasis on text-guided editing. To address the lack of standardized evaluation, we introduce EditEval—the first comprehensive benchmark for text-guided image editing—and propose LMM Score, a novel multimodal evaluation metric leveraging large multimodal models. We further provide the first unified taxonomy and empirical comparison between multimodal conditional editing methods and traditional context-driven approaches. Additionally, we release Awesome-Diffusion-Model-Based-Image-Editing-Methods, an open-source repository curating state-of-the-art techniques. Our study establishes a technical landscape spanning theoretical foundations, methodological frameworks, and evaluation standards. It identifies key limitations—including scalability, controllability, and evaluation consistency—and outlines concrete directions for future research. The work thus bridges critical gaps in both methodology and assessment, advancing the rigor and reproducibility of diffusion-based image editing.
📝 Abstract
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research. The accompanying repository is released at https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods.