🤖 AI Summary
Existing image object removal methods suffer from incomplete removal, background synthesis artifacts, and blurring, resulting in limited success rates. To address the content ambiguity bottleneck inherent in self-supervised paradigms, this work proposes a human-in-the-loop semi-supervised learning framework. Our approach introduces (i) a novel human-feedback-driven iterative data curation mechanism and automated discriminator construction; (ii) the first large-scale, high-quality paired dataset comprising over 200,000 object-removal samples; and (iii) a synergistic integration of fine-tuned Stable Diffusion, adversarial discriminator-guided data augmentation, and multi-round active learning. Experiments demonstrate that our method achieves an 18.3% absolute improvement in removal success rate over state-of-the-art methods, while significantly enhancing clarity of synthesized regions and background consistency—establishing new performance benchmarks.
📝 Abstract
Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18%. The dataset, source code and trained model are available at https://github.com/leeruibin/RORem.