Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the performance degradation in few-shot whole-slide image classification within multiple instance learning (MIL) frameworks, where randomly initialized linear classifiers often underperform zero-shot predictions. To bridge this gap, the authors propose ZS-MIL, a novel approach that leverages semantic class embeddings generated by the text encoder of a vision-language model to initialize the MIL classifier’s weights. This strategy effectively aligns few-shot learning with zero-shot semantic priors, significantly outperforming conventional random initialization across multiple histopathological subtype classification tasks. The proposed method not only improves classification accuracy but also enhances model stability, demonstrating a consistent and meaningful reduction in the performance disparity between zero-shot and few-shot settings.

Technology Category

Application Category

📝 Abstract

Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

Problem

Research questions and friction points this paper is trying to address.

few-shot learning

multiple instance learning

classifier initialization

vision-language models

histopathological image classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Multiple-Instance Learning

Vision-Language Models

Few-Shot Adaptation

Classifier Initialization

Histopathological Image Classification

🔎 Similar Papers

No similar papers found.

Authors to Follow