🤖 AI Summary
Current research on emergency triage and mass casualty incident (MCI) triage is hindered by the absence of publicly available, reproducible benchmark datasets. Method: We introduce the first open-source, large language model (LLM)-assisted emergency triage benchmark, covering both routine hospital emergency department (ED) settings and MCI field simulations, with support for clinical deterioration prediction. Leveraging LLMs, we standardize clinical narratives, align heterogeneous multi-source tables, integrate noisy fields, and prioritize critical features—substantially enhancing data reproducibility and accessibility. We construct a structured triage dataset derived from MIMIC-IV-ED, incorporate SHAP-based interpretability analysis, and release multiple baseline models. Results: Experiments reveal scenario-dependent performance disparities and identify core triage indicators—including heart rate, systolic blood pressure, and level of consciousness—thereby advancing clinical AI democratization and intelligent triage.
📝 Abstract
Research on emergency and mass casualty incident (MCI) triage has been limited by the absence of openly usable, reproducible benchmarks. Yet these scenarios demand rapid identification of the patients most in need, where accurate deterioration prediction can guide timely interventions. While the MIMIC-IV-ED database is openly available to credentialed researchers, transforming it into a triage-focused benchmark requires extensive preprocessing, feature harmonization, and schema alignment -- barriers that restrict accessibility to only highly technical users.
We address these gaps by first introducing an open, LLM-assisted emergency triage benchmark for deterioration prediction (ICU transfer, in-hospital mortality). The benchmark then defines two regimes: (i) a hospital-rich setting with vitals, labs, notes, chief complaints, and structured observations, and (ii) an MCI-like field simulation limited to vitals, observations, and notes. Large language models (LLMs) contributed directly to dataset construction by (i) harmonizing noisy fields such as AVPU and breathing devices, (ii) prioritizing clinically relevant vitals and labs, and (iii) guiding schema alignment and efficient merging of disparate tables.
We further provide baseline models and SHAP-based interpretability analyses, illustrating predictive gaps between regimes and the features most critical for triage. Together, these contributions make triage prediction research more reproducible and accessible -- a step toward dataset democratization in clinical AI.