LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of open-vocabulary sketch semantic segmentation, where the absence of pixel-level annotations and the lack of texture and color in sketches hinder accurate semantic understanding. To tackle these issues, the authors propose a structure-aware weakly supervised framework that leverages, for the first time, the complementary nature of multi-layer attention maps in Vision Transformers. By accumulating attention maps across layers, the method constructs a robust structural prior to enable hierarchical semantic alignment and refines predictions during inference. Extensive experiments demonstrate significant performance gains, with mIoU improvements of 3.43, 8.01, and 15.74 on the FS-COCO, SFSD, and FrISS datasets, respectively, substantially enhancing both segmentation accuracy and spatial consistency.
📝 Abstract
Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary
scene sketch
semantic segmentation
weak supervision
line drawings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weak Supervision
Open-Vocabulary Segmentation
Sketch Semantic Segmentation
Vision Transformer
Cross-Layer Attention
🔎 Similar Papers