Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weakly supervised semantic segmentation (WSSS) suffers from low-quality pseudo-labels and reliance on external modules (e.g., Class Activation Mapping). To address these challenges, we propose an end-to-end Vision Transformer (ViT)-based WSSS framework that directly leverages intra-layer self-attention maps to generate class-sensitive pseudo-segmentation masks. Our method introduces multi-class-specific [CLS] tokens and a random masking strategy to enable fine-grained, class-aware localization; it further incorporates sparse training to enhance the interpretability and discriminability of attention maps. Crucially, no auxiliary modules or post-processing steps are required. Evaluated on PASCAL VOC and COCO, our approach produces high-fidelity pseudo-labels, enabling segmentation models trained thereon to achieve performance on par with fully supervised baselines. This significantly reduces dependence on expensive pixel-level annotations while maintaining strong generalization across benchmarks.

Technology Category

Application Category

📝 Abstract
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
Problem

Research questions and friction points this paper is trying to address.

Enhance interpretability of self-attention maps for WSSS
Generate accurate pseudo segmentation masks using ViT
Reduce dependency on fine-grained labeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Transformer attention maps
Trains sparse ViT with multiple CLS tokens
Aggregates self-attention maps for pseudo-masks
🔎 Similar Papers
No similar papers found.