🤖 AI Summary
This work addresses the inexact segmentation (IS) task by proposing a training-free, plug-and-play segmentation refinement method. The core idea leverages the generative prior encoded in a pre-trained Stable Diffusion model: by contrasting the original image with its mask-conditioned diffusion reconstruction at the pixel level, dense semantic correspondences are established; foreground probability maps are then iteratively updated to achieve coarse-to-fine segmentation refinement. To our knowledge, this is the first approach to explicitly model generation discrepancies—induced by mask conditioning—in diffusion models as discriminative supervision signals for segmentation, thereby bridging generative and discriminative vision tasks. Evaluated on multiple IS benchmarks, the method consistently outperforms state-of-the-art discriminative approaches. Both quantitative metrics and qualitative analysis demonstrate superior robustness and enhanced detail recovery, particularly under ambiguous boundaries and low-quality initial masks.
📝 Abstract
This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.