🤖 AI Summary
To address the performance limitations of weakly supervised semantic segmentation (WSSS) stemming from sparse image-level supervision and insufficient training data diversity, this paper proposes a trainable image augmentation agent that synergistically integrates large language models (LLMs) and diffusion models. Methodologically, we pioneer the coupling of LLMs and diffusion models to generate semantically consistent supplementary images; design a prompt self-refinement mechanism to enhance the stability of LLM outputs; and introduce an online filtering module to dynamically ensure generation quality and class balance. Evaluated on PASCAL VOC 2012 and MS COCO 2014, our approach substantially outperforms state-of-the-art WSSS methods, achieving absolute mIoU gains of 3.2–4.7 percentage points. These results validate the effectiveness and generalizability of our data-generation–driven augmentation strategy.
📝 Abstract
Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.