LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

📅 2024-08-22
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
The zero-shot performance of large language models (LLMs) on biomedical information extraction—specifically medical text classification and named entity recognition—remains poorly understood, particularly regarding structured output generation and the efficacy of advanced prompting techniques. Method: We conduct a systematic evaluation comparing standard prompting, chain-of-thought (CoT), self-consistency, and retrieval-augmented generation (RAG) grounded in PubMed and Wikipedia, while analyzing the impact of task-specific knowledge, parametric domain knowledge, and external knowledge integration. Contribution/Results: Standard prompting significantly outperforms CoT, self-consistency, and RAG across multiple biomedical benchmarks—revealing, for the first time, that general-purpose reasoning techniques are ill-suited for high-precision structured extraction in this domain. Our findings provide critical empirical evidence and methodological insights for deploying LLMs in biomedical applications, challenging assumptions about the universal applicability of advanced prompting strategies.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' performance on biomedical structured information extraction tasks
Evaluating impact of task knowledge, domain knowledge, and external knowledge on LLMs
Testing advanced prompting methods for biomedical precision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs on Medical Classification and NER
Evaluating standard prompting versus CoT and RAG
Highlighting limitations of advanced prompting in biomedicine
🔎 Similar Papers
No similar papers found.