đ€ AI Summary
This study investigates large language modelsâ (LLMsâ) ability to comprehend fine-grained immunotherapeutic biomarkersâsuch as PD-L1 and tumor mutational burden (TMB)âin breast cancer, using oncologist-annotated labels as the gold standard. Method: We systematically evaluate BERT, BioBERT, LLaMA, and GPT-series models on an expert-curated dataset of breast cancer literature abstracts, employing zero-shot and few-shot prompting, embedding similarity analysis, and concept activation mapping. Contribution/Results: Domain-adapted smaller modelsâparticularly BioBERTâachieve 78% accuracy, significantly outperforming GPT-4 (62%) and LLaMA-3 (59%). This constitutes the first empirical evidence that pre-trained compact models can surpass general-purpose LLMs in specific clinical concept recognition. Based on these findings, we propose ImmunoFOMO, a clinical-cognitive alignment evaluation framework that challenges the prevailing âscale-driven performanceâ paradigm in medical AI, underscoring the critical importance of domain knowledge integration and task-specific alignment.
đ Abstract
Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.