Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a novel “gender-assignment bias” in large language models (LLMs) deployed in clinical contexts: when prompted to assess whether a patient’s gender is clinically relevant to a diagnosis, LLMs exhibit significant inconsistency—systematically differing in judgments between male and female cases—while maintaining stable diagnostic outputs. Using real-world case reports from *The New England Journal of Medicine*, the authors conduct multi-round gender-swapping experiments to rigorously evaluate consistency across leading open- and closed-weight LLMs. The work provides the first empirical evidence that implicit bias manifests specifically at the level of relevance attribution—not diagnosis—thereby undermining the reliability of LLMs in clinical decision support. Crucially, this bias is orthogonal to diagnostic accuracy, revealing a previously overlooked dimension of fairness in medical AI. The study introduces a reproducible methodological framework for evaluating fairness in LLM-based clinical tools, advancing both technical assessment protocols and equity-aware deployment standards.

Technology Category

Application Category

📝 Abstract
The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.
Problem

Research questions and friction points this paper is trying to address.

Evaluating gender bias in LLMs for healthcare diagnosis consistency
Assessing patient gender relevance inconsistency across different LLMs
Identifying systematic gender interpretation disparities in clinical AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated gender bias in LLMs using NEJM case studies
Assigned different genders to LLMs for response analysis
Measured consistency in diagnosis and gender relevance judgments
🔎 Similar Papers
No similar papers found.
M
Mingxuan Liu
Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
Y
Yuhe Ke
Department of Anaesthesiology and Perioperative Medicine, Singapore General Hospital, Singapore
W
Wentao Zhu
Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
Mayli Mertens
Mayli Mertens
Centre for Ethics, Department of Philosophy, University of Antwerp, Antwerp, Belgium
Yilin Ning
Yilin Ning
Senior Research Fellow, Centre for Quantitative Medicine, Duke-NUS Medical School
Fair and Ethical AIBiostatisticsEpidemiologyStatistical programming
J
Jingchi Liao
Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
C
Chuan Hong
Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
D
Daniel Shu Wei Ting
Artificial Intelligence Office, SingHealth, Singapore, Singapore
Y
Yifan Peng
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY , USA
Danielle S. Bitterman
Danielle S. Bitterman
Harvard Medical School
OncologyNatural Language ProcessingArtificial Intelligence
M
Marcus Eng Hock Ong
Pre-hospital & Emergency Research Centre, Health Services Research & Population Health, Duke-NUS Medical School, Singapore, Singapore
N
Nan Liu
Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore