Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation in few-shot whole-slide image classification within multiple instance learning (MIL) frameworks, where randomly initialized linear classifiers often underperform zero-shot predictions. To bridge this gap, the authors propose ZS-MIL, a novel approach that leverages semantic class embeddings generated by the text encoder of a vision-language model to initialize the MIL classifier’s weights. This strategy effectively aligns few-shot learning with zero-shot semantic priors, significantly outperforming conventional random initialization across multiple histopathological subtype classification tasks. The proposed method not only improves classification accuracy but also enhances model stability, demonstrating a consistent and meaningful reduction in the performance disparity between zero-shot and few-shot settings.

Technology Category

Application Category

📝 Abstract
Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.
Problem

Research questions and friction points this paper is trying to address.

few-shot learning
multiple instance learning
classifier initialization
vision-language models
histopathological image classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Multiple-Instance Learning
Vision-Language Models
Few-Shot Adaptation
Classifier Initialization
Histopathological Image Classification
🔎 Similar Papers
No similar papers found.
P
Pablo Meseguer
Instituto Universitario de Investigación en Tecnología Centrada en el Ser Humano (HUMAN-Tech), Universitat Politècnica de València, Valencia, Spain
Rocío del Amor
Rocío del Amor
Universidad politécnica de Valencia
Artificial IntelligenceComputer Vision
Valery Naranjo
Valery Naranjo
Universitat Politècncia de València
image processingvideo processingdeep learningmachine learninghistological image processing