🤖 AI Summary
To address low erasure accuracy and incomplete background restoration in complex multi-identity-person (multi-IP) scenarios—such as severe human occlusion, person-object entanglement, and camouflaged backgrounds—this paper proposes a semantic-decoupled multi-level diffusion framework. Methodologically, it introduces three key innovations: (1) the first large-scale, high-diversity multi-IP portrait erasure dataset featuring strong occlusion and rich inter-person interactions; (2) a spatially modulated attention mechanism guided by human pose estimation and semantic parsing, enabling precise spatial decoupling of foreground instances; and (3) a multi-path generative architecture that hierarchically models semantic and geometric priors throughout the diffusion process. Extensive experiments on multiple challenging benchmarks demonstrate substantial improvements over state-of-the-art methods, achieving significant gains in erasure accuracy and visual fidelity under complex real-world conditions.
📝 Abstract
Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.