HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how large language models (LLMs) utilize and resist interference from contextual information—particularly contradictory biomedical knowledge—in health-related question answering. To this end, we introduce HealthContradict, the first expert-annotated benchmark dataset (920 instances) for systematically evaluating model reasoning under correct, incorrect, and contradictory contexts. Using multi-prompting strategies, we combine scientifically grounded factual answers with opposing stance documents to quantify how models integrate parametric knowledge with contextual evidence. Experimental results show that fine-tuned biomedical LLMs effectively leverage accurate context while suppressing interference from erroneous information; moreover, substantial inter-model variation exists in contextual sensitivity. This work provides the first systematic analysis of LLM reasoning mechanisms under information conflict in healthcare, establishing a rigorous evaluation benchmark and actionable insights for developing trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract
How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models'contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.
Problem

Research questions and friction points this paper is trying to address.

Evaluates language models' use of conflicting biomedical contexts
Assesses reasoning over contradictory health-related documents
Measures impact of correct versus incorrect context on responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates models with expert-verified contradictory biomedical contexts
Tests prompt settings with correct, incorrect, and contradictory contexts
Shows models exploit correct context while resisting incorrect context
🔎 Similar Papers
No similar papers found.
Boya Zhang
Boya Zhang
Lawrence Livermore National Laboratory
Design of ExperimentsGaussian processesActive learning
A
A. Bornet
Faculty of Medicine, University of Geneva, Geneva, Switzerland
R
Rui Yang
Department of Biomedical Informatics, Yong Loo Lin School of Medicine, National University of Singapore
N
Nan Liu
Department of Biostatistics & Bioinformatics, Duke University; Artificial Intelligence Institute, National University of Singapore
Douglas Teodoro
Douglas Teodoro
Professor, University of Geneva
biomedical NLPmachine learning for healthcaremedical informatics