Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained semantic segmentation of skin tumor whole-slide images (WSIs) faces significant challenges, including substantial morphological variability, ambiguous benign–malignant boundaries, and computational intractability at gigapixel scales. Existing vision-language models (VLMs) are largely restricted to slide-level classification or rely on coarse-grained interactive prompts, lacking pixel-level interpretability. This paper introduces ZEUS, the first zero-shot WSI semantic segmentation framework leveraging a frozen VLM encoder integrated with class-specific textual prompts. ZEUS extracts patch-level image embeddings, performs cosine similarity matching against text embeddings, and reconstructs high-resolution tumor masks via sliding-window aggregation—requiring no pixel-level annotations. Evaluated on two internal clinical datasets, ZEUS achieves competitive segmentation performance while drastically reducing annotation overhead. Its outputs are inherently interpretable, scalable, and directly compatible with clinical diagnostic workflows.

Technology Category

Application Category

📝 Abstract
Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot segmentation of skin tumors in gigapixel whole-slide images
Automated tumor delineation without pixel-level labels using vision-language models
Addressing fine-grained segmentation challenges across diverse histological patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot segmentation using vision-language foundation models
Automated pipeline with textual prompt ensembles
Generating tumor masks via cosine similarity computation
🔎 Similar Papers
No similar papers found.
S
Santiago Moreno
Instituto Universitario de Investigación en Tecnología Centrada en el Ser Humano (HUMAN-Tech), Universitat Politécnica de Valencia, Valencia, Spain
P
Pablo Meseguer
Instituto Universitario de Investigación en Tecnología Centrada en el Ser Humano (HUMAN-Tech), Universitat Politécnica de Valencia, Valencia, Spain
Rocío del Amor
Rocío del Amor
Universidad politécnica de Valencia
Artificial IntelligenceComputer Vision
Valery Naranjo
Valery Naranjo
Universitat Politècncia de València
image processingvideo processingdeep learningmachine learninghistological image processing