🤖 AI Summary
The zero-shot performance of large language models (LLMs) on biomedical information extraction—specifically medical text classification and named entity recognition—remains poorly understood, particularly regarding structured output generation and the efficacy of advanced prompting techniques. Method: We conduct a systematic evaluation comparing standard prompting, chain-of-thought (CoT), self-consistency, and retrieval-augmented generation (RAG) grounded in PubMed and Wikipedia, while analyzing the impact of task-specific knowledge, parametric domain knowledge, and external knowledge integration. Contribution/Results: Standard prompting significantly outperforms CoT, self-consistency, and RAG across multiple biomedical benchmarks—revealing, for the first time, that general-purpose reasoning techniques are ill-suited for high-precision structured extraction in this domain. Our findings provide critical empirical evidence and methodological insights for deploying LLMs in biomedical applications, challenging assumptions about the universal applicability of advanced prompting strategies.
📝 Abstract
Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.