🤖 AI Summary
This study investigates large language models’ (LLMs’) ability to comprehend fine-grained immunotherapeutic biomarkers—such as PD-L1 and tumor mutational burden (TMB)—in breast cancer, using oncologist-annotated labels as the gold standard. Method: We systematically evaluate BERT, BioBERT, LLaMA, and GPT-series models on an expert-curated dataset of breast cancer literature abstracts, employing zero-shot and few-shot prompting, embedding similarity analysis, and concept activation mapping. Contribution/Results: Domain-adapted smaller models—particularly BioBERT—achieve 78% accuracy, significantly outperforming GPT-4 (62%) and LLaMA-3 (59%). This constitutes the first empirical evidence that pre-trained compact models can surpass general-purpose LLMs in specific clinical concept recognition. Based on these findings, we propose ImmunoFOMO, a clinical-cognitive alignment evaluation framework that challenges the prevailing “scale-driven performance” paradigm in medical AI, underscoring the critical importance of domain knowledge integration and task-specific alignment.
📝 Abstract
Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.