🤖 AI Summary
Clinical deployment of AI in oncology is hindered by scarcity of annotated data and high computational costs associated with model retraining. Method: This work presents the first systematic evaluation of multimodal vision-language models (VLMs)—including Paligemma, CLIP, ALIGN, and GPT-4o—in few-shot in-context learning (ICL) for tumor pathology image diagnosis, without parameter updates or fine-tuning. Experiments are conducted across multiple real-world tumor pathology datasets under zero-shot and few-shot ICL settings. Contribution/Results: GPT-4o achieves F1 scores of 0.81 (binary classification) and 0.60 (multiclass classification); notably, lightweight open-source VLMs (e.g., Paligemma) attain comparable performance, underscoring their viability in low-resource clinical settings. The study demonstrates that ICL enables expert-level tumor classification using only a handful of annotated examples—eliminating the need for retraining—and establishes an efficient, lightweight, and generalizable paradigm for diagnosing rare cancers.
📝 Abstract
The application of AI in oncology has been limited by its reliance on large, annotated datasets and the need for retraining models for domain-specific diagnostic tasks. Taking heed of these limitations, we investigated in-context learning as a pragmatic alternative to model retraining by allowing models to adapt to new diagnostic tasks using only a few labeled examples at inference, without the need for retraining. Using four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o, we evaluated the performance across three oncology datasets: MHIST, PatchCamelyon and HAM10000. To the best of our knowledge, this is the first study to compare the performance of multiple VLMs on different oncology classification tasks. Without any parameter updates, all models showed significant gains with few-shot prompting, with GPT-4o reaching an F1 score of 0.81 in binary classification and 0.60 in multi-class classification settings. While these results remain below the ceiling of fully fine-tuned systems, they highlight the potential of ICL to approximate task-specific behavior using only a handful of examples, reflecting how clinicians often reason from prior cases. Notably, open-source models like Paligemma and CLIP demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments. Overall, these findings highlight the potential of ICL as a practical solution in oncology, particularly for rare cancers and resource-limited contexts where fine-tuning is infeasible and annotated data is difficult to obtain.