🤖 AI Summary
This work addresses the performance degradation in few-shot whole-slide image classification within multiple instance learning (MIL) frameworks, where randomly initialized linear classifiers often underperform zero-shot predictions. To bridge this gap, the authors propose ZS-MIL, a novel approach that leverages semantic class embeddings generated by the text encoder of a vision-language model to initialize the MIL classifier’s weights. This strategy effectively aligns few-shot learning with zero-shot semantic priors, significantly outperforming conventional random initialization across multiple histopathological subtype classification tasks. The proposed method not only improves classification accuracy but also enhances model stability, demonstrating a consistent and meaningful reduction in the performance disparity between zero-shot and few-shot settings.
📝 Abstract
Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.