🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ (LLMs’) capabilities in evidence-based policy reasoning, proactive information completion, and principled refusal under uncertainty in patient safety event triage. This work proposes PSEBench, the first agent environment enabling auditable and scalable evaluation: it structures policy rules as clause cards, integrates anchor-driven narrative generation with a closed-loop verification mechanism, and constructs a dataset of 5,074 synthetically generated ground-truth cases that explicitly simulate missing information and uncertainty. Evaluation across 15 mainstream LLMs demonstrates the benchmark’s validity and reveals critical deficiencies in current models’ ability to refuse answering when uncertain and to actively seek necessary information, thereby highlighting key directions for improvement.
📝 Abstract
Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.