NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the challenge of evaluating large language models’ (LLMs) alignment with core nursing values in clinical settings. We introduce NursingValueBench—the first clinical-domain benchmark for nursing value alignment—grounded in international nursing standards and structured along five dimensions: altruism, human dignity, integrity, justice, and professionalism. The benchmark integrates real-world behavioral data from tertiary hospitals and LLM-generated counterfactual samples to construct a dialogue-based evaluation set with easy/hard difficulty tiers. We propose a novel adversarial evaluation paradigm embedding clinical context and misleading signals, and formally define and quantify value alignment capability for the first time. Evaluating 23 state-of-the-art models on 2,200 instances, DeepSeek-V3 (94.55%) and Claude 3.5 Sonnet (89.43%) achieve top performance; alignment with “justice” proves most challenging, while in-context learning (ICL) significantly improves accuracy.

Technology Category

Application Category

📝 Abstract

This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The benchmark comprises 1,100 real-world nursing behavior instances collected through a five-month longitudinal field study across three hospitals of varying tiers. These instances are annotated by five clinical nurses and then augmented with LLM-generated counterfactuals with reversed ethic polarity. Each original case is paired with a value-aligned and a value-violating version, resulting in 2,200 labeled instances that constitute the Easy-Level dataset. To increase adversarial complexity, each instance is further transformed into a dialogue-based format that embeds contextual cues and subtle misleading signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA) LLMs on their alignment with nursing values. Our findings reveal three key insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2) Justice is consistently the most difficult nursing value dimension to evaluate; and (3) in-context learning significantly improves alignment. This work aims to provide a foundation for value-sensitive LLMs development in clinical settings. The dataset and the code are available at https://huggingface.co/datasets/Ben012345/NurValues.

Problem

Research questions and friction points this paper is trying to address.

Evaluating nursing value alignment in LLMs using real-world clinical data

Assessing five core nursing values across 2,200 labeled instances

Measuring adversarial robustness via dialogue-based value violation tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for nursing value alignment

Real-world nursing behavior instances collection

Dialogue-based adversarial dataset transformation

🔎 Similar Papers

No similar papers found.